Recognition: unknown
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
Pith reviewed 2026-05-09 19:06 UTC · model grok-4.3
The pith
Mixture-of-experts routing succeeds at domain transitions when the gate adds temporal memory, precision weighting, and anticipation from the Free Energy Principle.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Affinity routing alone cannot solve expert selection across domain transitions because it lacks state to detect approaching changes. Adding a per-expert LIF membrane potential (beta) that accumulates routing context, precision-weighted gating (Pi) that amplifies reliable experts, and a next-state predictor conditioned on the accumulated state raises the probability on the correct expert at the transition point from 0.006 to 0.748 in synthetic experiments and from 0.42 to 0.86 in a character-level language model, while cutting transition bits per character from 6.56 to 4.01.
What carries the argument
The combination of a per-expert LIF membrane potential (beta) for temporal memory, precision-weighted gating (Pi) using inverse variance of prediction errors, and anticipatory routing that predicts the next hidden state from the accumulated context.
If this is right
- MoE models require only a small fixed number of experts for reliable coverage across many domains once the gate carries temporal state.
- Routing decisions become predictive, allowing expert activation before the new domain tokens arrive.
- The super-additive interaction between temporal memory and anticipation shows that stateless predictors fail precisely because pre-transition tokens are distributionally identical to within-domain tokens.
- The three modifications can be implemented in roughly 200 lines and require no change to the experts themselves.
Where Pith is reading between the lines
- The same stateful gate changes could improve robustness in other sparse architectures that route on affinity alone.
- Applying the modifications at the scale of current large language models would test whether the gains persist when domain shifts arise from topic, style, or language changes rather than synthetic switches.
- The approach embeds simple predictive-coding dynamics inside the router without requiring full variational inference over the entire model.
Load-bearing premise
The controlled four-expert synthetic transitions and the character-level language model setup capture the distribution-shift problems that occur in large-scale deployed MoE models.
What would settle it
Run the modified gate on a large-scale MoE language model trained on heterogeneous web text and measure whether the probability assigned to the correct domain expert at natural domain-switch points rises by a comparable factor.
read the original abstract
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that standard affinity-based routing in sparse MoE models fails at domain transitions (assigning only 0.006 probability to the correct expert), and that three lightweight gate modifications inspired by the Free Energy Principle—temporal memory via per-expert LIF membrane potential (beta), precision-weighted gating (Pi) based on inverse variance of prediction error, and anticipatory next-state prediction conditioned on beta states—raise this to 0.748 (124x improvement) in a 4-expert synthetic setup and yield better transition BPC and anticipatory probability in a character-level MoE LM. A full 2^3 ablation shows a super-additive beta x Ant interaction that closes 75% of the oracle gap.
Significance. If the empirical gains hold, the work offers a low-overhead, neuroscience-inspired route to more robust MoE routing under shifts, with the potential to reduce the number of experts needed for coverage. Strengths include the complete 2^3 ablation across 5 seeds, direct measurement of the super-additive delta (+0.446), consistent numerical improvements, and open reference implementations (~200 lines) that enable verification.
major comments (3)
- [Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.
- [§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.
- [§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.
minor comments (2)
- [§3] The LIF update rule for beta and the exact formula for Pi (inverse variance) are described at high level; adding the discrete-time equations in the main text (rather than deferring to GitHub) would improve clarity and allow readers to verify the 31x contrast claim without external code.
- [§4] Ablation tables/figures should explicitly state the statistical test and degrees of freedom for the super-additive interaction delta (+0.446 +/- 0.014) and confirm that all 8 conditions used the same 5 seeds.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments identify key areas where the manuscript's framing, experimental scope, and comparisons can be clarified or strengthened. We address each major comment below, indicating revisions where the manuscript will be updated to better reflect the work's contributions and limitations.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.
Authors: We agree that the mechanisms are motivated by and draw from principles in the Free Energy Principle (precision weighting, predictive coding, and temporal integration) but are not formally derived from FEP variational equations or shown to be mathematically equivalent. The title and abstract use 'recovering' to describe the functional recovery of effective routing behavior via these principles. We will revise the title, abstract, and §1 to state that the modifications are 'inspired by' the Free Energy Principle, explicitly note the inspirational rather than derivational nature of the connection, and remove any implication of formal equivalence. This is a textual clarification only. revision: yes
-
Referee: [§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.
Authors: The synthetic setup deliberately isolates the core failure mode of affinity routing: pre-transition tokens are distributionally indistinguishable from in-domain tokens, so no stateless router can succeed without temporal memory. This is why anticipation alone yields zero gain while the beta-Ant interaction is super-additive. We acknowledge that the experiment uses fixed, fully specialized experts and abrupt switches rather than gradual shifts or end-to-end jointly trained routers. In revision we will expand the discussion in §4 and the conclusion to explicitly state these scope limitations and outline how the mechanisms could be integrated into learned routers for gradual or partial-specialization settings. revision: partial
-
Referee: [§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.
Authors: The 'Standard' baseline implements the affinity-based routing that is the default in many sparse MoE implementations; the reported gains are therefore relative to that common starting point. We agree that direct comparisons against jointly optimized top-k routers and on established multi-domain benchmarks would better quantify practical advantage. Because the current experiments prioritize controlled isolation of the proposed mechanisms, we will add a paragraph in §5 and the discussion noting this gap and, where feasible within revision time, include a limited comparison using the released reference implementations against a basic learned top-k router on the same character-level task. revision: yes
Circularity Check
No circularity: empirical results are measured directly; FEP supplies only design motivation.
full rationale
The paper reports concrete experimental outcomes (transition-step probabilities, BPC values, ablation deltas) computed from held-out synthetic 4-expert transitions and a character-level LM. The three gate modifications are implemented as explicit algorithms (LIF membrane, inverse-variance weighting, next-state predictor) and evaluated via 2^3 ablations whose numbers are obtained by direct measurement rather than by any equation that reduces to the inputs. The FEP reference appears only as inspirational framing for the choice of mechanisms; it does not enter the routing equations, the probability calculations, or the reported statistics. No self-citation load-bearing step, no fitted parameter renamed as prediction, and no uniqueness theorem invoked. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption MoE experts become specialized to distinct data distributions in the training regime
Reference graph
Works this paper leans on
-
[1]
Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
01 ± 0. 15 (β-MoE); the β+ Ant gate places 0. 86 ± 0. 02 probability on the correct domain expert before that domain becomes visible in the input, versus 0. 42 ± 0. 12 for Standard MoE. Reference implementations are ∼ 200 lines each and released as prototype/ alongside the paper. 1 Introduction A router that knows only the current token is a navigator wit...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
We propose three mechanisms — temporal memory (β), precision-weighted gating (Π ), and anticipatory routing — that raise the probability mass on the correct expert at domain- transition steps from 0. 006 ± 0. 001 to 0. 748 ± 0. 002 in controlled experiments (5 seeds, 124 × increase), reducing the experts required for 99% routing coverage from infeasibly l...
-
[3]
We identify a super-additiveβ× Ant interaction: anticipation alone provides no transition- step gain ( +0. 000 ± 0. 001), βalone provides modest gain ( +0. 295 ± 0. 013), but combined they close 75% of the oracle gap ( +0. 741 ± 0. 002, exceeding the sum of individual gains by +0. 446 ± 0. 014). This is a structural finding: a stateless predictor cannot de...
-
[4]
We validate the mechanisms in a character-level MoE language model (5 seeds), where adding βalone reduces transition-step BPC from 6. 56 ± 0. 01 (Standard MoE) to 4. 01 ± 0. 15 (β- MoE); adding the anticipatory predictor on top of βraises routing-correct probability at the transition step from 0. 60 ± 0. 22 to 0. 86 ± 0. 02, stabilising the routing decisi...
-
[5]
We connect MoE routing to Friston’s Free Energy Principle, showing that each mechanism has a principled motivation in the FEP equations (recursive state estimation, precision update, expected free energy minimization) and is naturally instantiated using LIF dynamics from spiking neural networks (Sections 2, 3)
-
[6]
Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3)
We provide a structural explanation for why DeepSeek’s MTP module is discarded at inference: a stateless next-token predictor cannot detect approaching transitions, regardless of training budget. Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3). Reference implementations of each mechanism (5–10 lines ...
2010
-
[7]
wrong input-to-expert affinity
013 vs. 0. 006 ± 0. 001). 3.2 Precision-Weighted Gating ( Π) Theoretical basis. In the FEP hierarchy, prediction errors are precision-weighted before being passed upward ([1], Box 2): ξi = Π i ·εi where Π i =σ− 2 i is the precision (inverse variance) of expert i’s prediction error. The gain update for the precision parameter γi (where Π i =eγi) is: ∆γi ∝ ...
-
[9]
R WKV [16] and RetNet
— reformulate sequence processing as convolution over learned decay kernels, directly imple- menting the U [t] = βU[t − 1] +x[t] structure at the representation level. R WKV [16] and RetNet
-
[10]
low affinity
derive recurrent formulations of attention that trade expressiveness for linear-time inference. All of these apply recurrence to the representation — the token embedding that feeds into attention and FFN layers. Our βmechanism applies recurrence to the routing gate alone: the token repre- sentationxt is unchanged; only the routing decision carries tempora...
2001
-
[11]
DeepSeek-V3 uses MTP with D = 1 additional prediction head, finding improvements in pass@k coding benchmarks
showed that training language models to predict multiple future tokens simultaneously (multi- token prediction, MTP) improves both training efficiency and downstream task performance. DeepSeek-V3 uses MTP with D = 1 additional prediction head, finding improvements in pass@k coding benchmarks. Both papers discard MTP at inference. Our analysis provides a prin...
-
[12]
and related work have proposed that the FEP may provide a unifying framework for cognitive architectures more broadly. Our paper is narrower: we make a specific, falsifiable claim about three mechanisms in MoE routing, motivate each from the FEP equations, and validate it empirically. We do not claim our implementation computes variational free energy in th...
-
[13]
K. Friston, “The free-energy principle: A unified brain theory?” Nature Reviews Neuro- science, vol. 11, no. 2, pp. 127–138, 2010, doi: 10.1038/nrn2787
-
[14]
Training spiking neural networks using lessons from deep learning,
J. K. Eshraghian et al. , “Training spiking neural networks using lessons from deep learning,” Proceedings of the IEEE , vol. 111, no. 9, pp. 1016–1054, 2023, A vailable: https://arxiv. org/abs/2109.12894
-
[15]
Layerwise recurrent router for mixture-of-experts,
Z. Qiu et al. , “Layerwise recurrent router for mixture-of-experts,” in Proceedings of ICLR 2025, 2024. A vailable: https://arxiv.org/abs/2408.06793
-
[16]
Temporally Extended Mixture-of-Experts Models
Y. Shen and J. Henderson, “Temporally extended mixture-of-experts,” arXiv preprint arXiv:2604.20156, 2026, A vailable: https://arxiv.org/abs/2604.20156
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
A. Liu et al. , “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437 , 2024, A vail- able: https://arxiv.org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
ExpertFlow: Efficient mixture-of-experts inference via predictive expert caching and token scheduling,
X. He et al. , “ExpertFlow: Efficient mixture-of-experts inference via predictive expert caching and token scheduling,” arXiv preprint arXiv:2410.17954 , 2024, A vailable: https: //arxiv.org/abs/2410.17954
-
[19]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
N. Shazeer et al. , “Outrageously large neural networks: The sparsely-gated mixture-of- experts layer,” in Proceedings of ICLR 2017 , 2017. A vailable: https://arxiv.org/abs/ 1701.06538
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[20]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,
W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022, A vailable: https://www.jmlr.org/papers/v23/21-0998.html
2022
-
[21]
A. Q. Jiang et al. , “Mixtral of experts,” arXiv preprint arXiv:2401.04088 , 2024, A vailable: https://arxiv.org/abs/2401.04088
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
arXiv preprint arXiv:2202.09368 , year=
Y. Zhou et al. , “Mixture-of-experts with expert choice routing,” in Advances in neural in- formation processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2202.09368
-
[23]
ST-MoE: Designing Stable and Transferable Sparse Expert Models
B. Zoph et al. , “ST-MoE: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906 , 2022, A vailable: https://arxiv.org/abs/2202.08906
work page internal anchor Pith review arXiv 2022
-
[24]
MoxE: Mixture of xLSTM experts with entropy-aware routing,
A. M. O. Thiombiano et al., “MoxE: Mixture of xLSTM experts with entropy-aware routing,” arXiv preprint arXiv:2505.01459 , 2025, A vailable: https://arxiv.org/abs/2505.01459
-
[25]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in Proceedings of ACL 2019, 2019, pp. 2978–2988. A vailable: https://arxiv.org/abs/1901.02860 24
work page Pith review arXiv 2019
-
[26]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” in Proceedings of ICLR 2022 , 2022. A vailable: https://arxiv.org/abs/2111. 00396
2022
-
[27]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023, A vailable: https://arxiv.org/abs/2312.00752
work page internal anchor Pith review arXiv 2023
-
[28]
RWKV: Reinventing RNNs for the Transformer Era
B. Peng et al. , “R WKV: Reinventing RNNs for the transformer era,” in Findings of EMNLP 2023, 2023, pp. 14048–14073. A vailable: https://arxiv.org/abs/2305.13048
work page internal anchor Pith review arXiv 2023
-
[29]
Retentive Network: A Successor to Transformer for Large Language Models
Y. Sun et al. , “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621 , 2023, A vailable: https://arxiv.org/abs/2307.08621
work page internal anchor Pith review arXiv 2023
-
[30]
Nature Neuroscience 2(1), 79–87 (1999)
R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, pp. 79–87, 1999, doi: 10.1038/4580
-
[31]
J. C. R. Whittington and R. Bogacz, “An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity,” Neural Computation, vol. 29, no. 5, pp. 1229–1262, 2017, doi: 10.1162/NECO_a_00949
-
[32]
Predictive coding approximates backprop along arbitrary computation graphs,
B. Millidge, A. Tschantz, and C. L. Buckley, “Predictive coding approximates backprop along arbitrary computation graphs,” Neural Computation , vol. 34, no. 6, pp. 1329–1368, 2022, doi: 10.1162/neco_a_01497
-
[33]
Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,
X. Li and M. Wicker, “Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,” arXiv preprint arXiv:2603.09453 , 2026, A vailable: https: //arxiv.org/abs/2603.09453
-
[34]
“Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection,” arXiv preprint arXiv:2412.19108 , 2024, A vailable: https://arxiv. org/abs/2412.19108
-
[35]
F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” in Proceedings of ICML 2024 , 2024. A vailable: https://arxiv.org/abs/2404.19737
-
[36]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735
-
[37]
Learning phrase representations using RNN encoder-decoder for statistical machine translation,
K. Cho et al. , “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of EMNLP 2014 , 2014, pp. 1724–1734. A vailable: https://aclanthology.org/D14-1179/
2014
-
[38]
Neural Computation29(1), 1 – 49 (2017).https: //doi.org/10.1162/NECO_a_00912
K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo, “Active infer- ence: A process theory,” Neural Computation , vol. 29, no. 1, pp. 1–49, 2017, doi: 10.1162/NECO_a_00912
-
[39]
Deep active inference agents using Monte-Carlo methods,
Z. Fountas, N. Sajid, P. A. M. Mediano, and K. Friston, “Deep active inference agents using Monte-Carlo methods,” in Advances in neural information processing systems 33 , 2020, pp. 11662–11675. A vailable: https://arxiv.org/abs/2006.04176
-
[40]
Learning generative state space models for active inference,
O. Çatal, S. Wauthier, C. De Boom, T. Verbelen, and B. Dhoedt, “Learning generative state space models for active inference,” Frontiers in Computational Neuroscience , vol. 14, p. 574372, 2020, doi: 10.3389/fncom.2020.574372. 25
-
[41]
Predic- tive coding beyond Gaussian distributions,
L. Pinchetti, T. Salvatori, Y. Yordanov, B. Millidge, Y. Song, and T. Lukasiewicz, “Predic- tive coding beyond Gaussian distributions,” in Advances in neural information processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2211.03481
-
[42]
B. Millidge, T. Salvatori, Y. Song, R. Bogacz, and T. Lukasiewicz, “Predictive coding: To- wards a future of deep learning beyond backpropagation?” arXiv preprint arXiv:2202.09467 , 2022, A vailable: https://arxiv.org/abs/2202.09467
-
[43]
Path integrals, particular kinds and strange things,
K. Friston et al. , “Path integrals, particular kinds and strange things,” Physics of Life Reviews, vol. 47, pp. 35–62, 2023, doi: 10.1016/j.plrev.2023.08.016
-
[44]
E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,” IEEE Signal Processing Magazine , vol. 36, no. 6, pp. 51–63, 2019, A vailable: https: //arxiv.org/abs/1901.09948
-
[45]
A solution to the learning dilemma for recurrent networks of spiking neurons,
G. Bellec et al. , “A solution to the learning dilemma for recurrent networks of spiking neurons,” Nature Communications, vol. 11, p. 3625, 2020, doi: 10.1038/s41467-020-17236- y
-
[46]
Z. Zhou et al., “Spikformer: When spiking neural network meets transformer,” in Proceedings of ICLR 2023 , 2022. A vailable: https://arxiv.org/abs/2209.15425
-
[47]
SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,
R.-J. Zhu, Q. Zhao, G. Li, and J. K. Eshraghian, “SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,” arXiv preprint arXiv:2302.13939 , 2023, A vail- able: https://arxiv.org/abs/2302.13939 26
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.