pith. machine review for the scientific record. sign in

arxiv: 2602.13215 · v2 · submitted 2026-01-22 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 11:46 UTC · model grok-4.3

classification 💻 cs.AI
keywords AMORadaptive entropy gatehybrid recurrent-attention modelsMamba2Gated DeltaNetuncertainty routinglong-context performanceSystem 1 System 2
0
0 comments X

The pith

AMOR routes attention to high-entropy tokens in recurrent models, matching full hybrids while using attention on only 22 percent of positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AMOR as a post-hoc addition to recurrent backbones that activates attention blocks only when output entropy exceeds a dynamic threshold set from the running batch median plus scaled standard deviation. This creates a gradient-free decision rule for when to supplement recurrence with attention. Experiments across Mamba2 and Gated DeltaNet models sized 180M to 1.5B show that the selective routing reaches or exceeds the accuracy of both pure recurrent models and fixed-schedule hybrids. The same models preserve stable performance on LongBench long-context tasks where prior hybrids lose ground under distribution shift. The result frames attention use as a matter of timing rather than uniform application.

Core claim

AMOR augments a recurrent backbone with entropy-gated attention blocks that activate only when the model's output entropy exceeds a dynamic threshold derived from the running batch median and scaled standard deviation. This gradient-free router, inspired by uncertainty-driven computation and the System 1/System 2 distinction, is applied after the recurrent state has already been computed. Across Mamba2 and Gated DeltaNet backbones from 180M to 1.5B parameters the method matches or outperforms both pure recurrent baselines and fixed hybrid schedules while invoking attention on roughly 22 percent of tokens and maintaining stable long-context results on LongBench.

What carries the argument

The entropy gate: a threshold comparator that checks model output entropy against a running batch median plus scaled standard deviation to decide whether to invoke the attention block at that position.

If this is right

  • Hybrid models reach full performance with attention applied to only about one-fifth of tokens.
  • Long-context stability improves because attention is reserved for positions where recurrence alone is uncertain.
  • No per-task retuning or gradient-based routing is required beyond the fixed entropy rule.
  • Common-sense reasoning tasks benefit from the same selective allocation pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Uncertainty measured at the output may serve as a general signal for routing compute between light and heavy modules in other sequence models.
  • The median-plus-deviation rule could be tested as a drop-in replacement for learned routers in mixture-of-experts architectures.
  • If the threshold proves stable across domains it would allow deployment-time scaling of attention use without retraining.

Load-bearing premise

The batch-median-plus-scaled-deviation threshold on output entropy correctly identifies the positions where recurrent state alone is insufficient for accurate prediction.

What would settle it

A direct measurement showing that positions above the entropy threshold are not those where the recurrent prediction errs while attention would correct it, or a benchmark where AMOR falls below pure recurrent performance despite the selective routing.

Figures

Figures reproduced from arXiv: 2602.13215 by Chen Shani, Haoran Zheng.

Figure 1
Figure 1. Figure 1: AMOR Architecture. An SSM (System 1) processes each token, producing predictions and hidden states. The entropy gate monitors prediction entropy: high entropy (”I don’t know”) triggers sparse attention (System 2) over the Ghost KV cache—keys and values projected from SSM hidden states, providing temporally-aware representations for retrieval. emerged to address this tradeoff. These designs combine SSMs wit… view at source ↗
Figure 2
Figure 2. Figure 2: Entropy distribution at local vs retrieval positions. The clear bimodal separation validates [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Entropy gap evolution during training. The model learns to be uncertain at retrieval [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SSM state decay vs noise length. Retrieval accuracy drops sharply as noise length in [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Gate firing pattern on an example sequence. Red indicates gate fires (attention engaged), [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training dynamics showing accuracy, gate metrics, retrieval vs local accuracy, and loss [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

Recurrent-attention hybrids aim to combine the efficiency of recurrence with the expressivity of attention, but existing approaches typically apply attention uniformly across all positions, even when the recurrent state alone is sufficient for accurate prediction. We introduce AMOR (Adaptive Metacognitive Output Router), a post-hoc hybrid architecture that selectively invokes attention based on predictive uncertainty. A recurrent backbone is augmented with entropy-gated attention blocks that activate only when the model's output entropy exceeds a dynamic threshold derived from a running batch median and scaled standard deviation. This yields a simple, gradient-free routing mechanism inspired by uncertainty-driven computation and the System 1 / System 2 distinction. Across Mamba2 and Gated DeltaNet backbones (180M-1.5B), AMOR consistently matches or outperforms both pure recurrent models and fixed-schedule hybrid baselines while invoking attention on only ~22% of tokens. It achieves strong performance on common-sense reasoning benchmarks and maintains stable long-context performance on LongBench, where prior hybrid models degrade under distribution shift. These results suggest that when attention is applied matters as much as how much: selectively allocating attention based on predictive uncertainty improves both efficiency and robustness, offering a simple alternative to uniform or fixed routing strategies and pointing toward adaptive hybrid architectures that dynamically match computation to input difficulty.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces AMOR (Adaptive Metacognitive Output Router), a post-hoc hybrid architecture for recurrent models (Mamba2, Gated DeltaNet) that selectively routes tokens to attention blocks when the recurrent backbone's output entropy exceeds a dynamic threshold computed from the running batch median plus a scaled standard deviation. The central claim is that this uncertainty-driven, gradient-free mechanism matches or exceeds both pure recurrent baselines and fixed-schedule hybrids while invoking attention on only ~22% of tokens, with stable performance on common-sense reasoning tasks and LongBench under distribution shift.

Significance. If the routing decisions prove reliable, AMOR offers a lightweight alternative to uniform or fixed hybrid schedules by tying attention invocation directly to predictive uncertainty. This could improve both computational efficiency and robustness in long-context settings, providing a simple empirical route toward adaptive System-1/System-2 hybrids without requiring end-to-end optimization of the router.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): the reported ~22% attention invocation rate and benchmark gains are presented without error bars, ablation tables on the scaling factor in the threshold, or verification that the batch-median statistic remains stable across different batch compositions and sequence lengths. This directly affects the load-bearing claim that the dynamic threshold reliably identifies positions where recurrent state is insufficient.
  2. [§3] §3 (AMOR Architecture): the entropy threshold is defined heuristically from observed batch statistics rather than derived or optimized against a downstream loss or error metric. No analysis is provided showing that high-entropy tokens actually correspond to positions where recurrent predictions diverge from ground truth, leaving the correlation between the proxy and actual insufficiency untested.
  3. [§4.2] §4.2 (LongBench Evaluation): the stability claim under distribution shift is supported only by aggregate scores; no per-task breakdown or comparison of routing decisions before/after shift is shown, so it is unclear whether the ~22% rate and performance retention result from correct adaptive routing or from incidental architecture effects.
minor comments (2)
  1. [§3] The exact scaling coefficient applied to the standard deviation in the threshold formula should be stated explicitly (including its value and any sensitivity analysis) rather than described only qualitatively.
  2. [Figures and Tables] Figure captions and tables should include the precise backbone sizes (180M–1.5B) and the number of runs used to compute the reported metrics.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major point below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the reported ~22% attention invocation rate and benchmark gains are presented without error bars, ablation tables on the scaling factor in the threshold, or verification that the batch-median statistic remains stable across different batch compositions and sequence lengths. This directly affects the load-bearing claim that the dynamic threshold reliably identifies positions where recurrent state is insufficient.

    Authors: We agree that including error bars and ablations would strengthen the claims. In the revised version, we will add error bars (standard deviations over 3 seeds) to the main results tables. We will also include an ablation study on the scaling factor (values 0.5, 1.0, 2.0) showing its impact on attention invocation rate and performance. For the batch-median stability, we have run additional experiments with varying batch sizes (4 to 32) and sequence lengths, confirming low variance in the threshold; these results will be added to §4. revision: yes

  2. Referee: [§3] §3 (AMOR Architecture): the entropy threshold is defined heuristically from observed batch statistics rather than derived or optimized against a downstream loss or error metric. No analysis is provided showing that high-entropy tokens actually correspond to positions where recurrent predictions diverge from ground truth, leaving the correlation between the proxy and actual insufficiency untested.

    Authors: The heuristic design is deliberate to keep AMOR training-free and applicable post-hoc. Deriving it from a loss would require end-to-end optimization, changing the method's nature. We will expand §3 to better motivate the choice of batch median + scaled std as a robust, distribution-free statistic. However, we cannot provide direct analysis correlating entropy with ground-truth errors without access to labels at test time, which would violate the unsupervised setting. This limitation will be explicitly stated. revision: partial

  3. Referee: [§4.2] §4.2 (LongBench Evaluation): the stability claim under distribution shift is supported only by aggregate scores; no per-task breakdown or comparison of routing decisions before/after shift is shown, so it is unclear whether the ~22% rate and performance retention result from correct adaptive routing or from incidental architecture effects.

    Authors: We will revise §4.2 to include a per-task performance table for LongBench. We will also add a new figure showing attention invocation rates per task before and after the distribution shift, demonstrating that the routing adapts consistently rather than relying on incidental effects. revision: yes

standing simulated objections not resolved
  • Providing a direct test of whether high-entropy tokens correspond to positions of recurrent prediction error, as this requires ground-truth labels unavailable during inference.

Circularity Check

0 steps flagged

No significant circularity detected in AMOR derivation

full rationale

The paper presents AMOR as a post-hoc, gradient-free empirical routing mechanism whose dynamic threshold is computed directly from running batch statistics (median and scaled std dev of output entropy) on the recurrent backbone outputs. No equations, predictions, or first-principles derivations are shown that reduce by construction to fitted parameters or self-citations; the central claims rest on empirical performance comparisons rather than any self-referential loop. The method is explicitly described as an addition applied after the backbone, with no load-bearing self-citation chains or ansatz smuggling identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that batch-level entropy statistics form a stable, task-agnostic proxy for when recurrence is insufficient. No free parameters are explicitly fitted in the abstract description; the scaling of the standard deviation is left implicit. No new physical or mathematical entities are introduced.

axioms (2)
  • domain assumption Output entropy is a sufficient statistic for deciding when recurrent state alone is inadequate for accurate prediction.
    Invoked in the description of the entropy-gated activation rule.
  • domain assumption Batch median and scaled standard deviation provide a reliable, gradient-free dynamic threshold that generalizes across inputs.
    Core of the routing mechanism stated in the abstract.

pith-pipeline@v0.9.0 · 5524 in / 1329 out tokens · 48353 ms · 2026-05-16T11:46:11.876477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

    Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

  2. [2]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,

  3. [3]

    The consciousness prior.arXiv preprint arXiv:1709.08568,

    Yoshua Bengio. The consciousness prior.arXiv preprint arXiv:1709.08568,

  4. [4]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas L ´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432,

  5. [5]

    Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

  6. [6]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Al- bert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, et al. Griffin: Mixing gated linear re- currences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

  7. [7]

    press/v235/dao24a.html

    Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, Yingyan Lin, Jan Kautz, and Pavlo Molchanov. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

  8. [8]

    Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,

    Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,

  9. [9]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

  10. [10]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  11. [11]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, and Yoav Shoham. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

  12. [12]

    Rwkv: Reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grber, et al. Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077,

  13. [13]

    Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

    David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

  14. [14]

    Retentive Network: A Successor to Transformer for Large Language Models

    Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621,

  15. [15]

    Hierarchical Reasoning Model

    Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734,

  16. [16]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768,

  17. [17]

    Context-selective state space models: Feedback is all you need.arXiv preprint arXiv:2510.14027,

    Riccardo Zattra, Giacomo Baggio, Umberto Casti, Augusto Ferrante, and Francesco Ticozzi. Context-selective state space models: Feedback is all you need.arXiv preprint arXiv:2510.14027,