pith. machine review for the scientific record. sign in

arxiv: 2605.00604 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.NE

Recognition: unknown

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

Man Yung Wong (Russell)

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:06 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords mixture of expertsroutingdomain transitionfree energy principletemporal memoryanticipatory routingleaky integrate and firesparse models
0
0 comments X

The pith

Mixture-of-experts routing succeeds at domain transitions when the gate adds temporal memory, precision weighting, and anticipation from the Free Energy Principle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard affinity-based routing in sparse MoE models assigns near-zero probability to the correct expert when the input distribution changes, because pre-transition tokens look identical to stable-domain tokens. The paper shows that three lightweight modifications to the gate—temporal memory via a per-expert leaky integrate-and-fire membrane potential, precision-weighted gating based on inverse variance of recent prediction errors, and an anticipatory next-state predictor—raise correct-expert probability at transitions from 0.006 to 0.748 in controlled four-expert tests. The same changes applied to a character-level MoE language model cut transition-step bits per character and let the model assign high probability to the right expert before its tokens appear. Ablations confirm the gains are super-additive: neither temporal memory nor anticipation works well alone.

Core claim

Affinity routing alone cannot solve expert selection across domain transitions because it lacks state to detect approaching changes. Adding a per-expert LIF membrane potential (beta) that accumulates routing context, precision-weighted gating (Pi) that amplifies reliable experts, and a next-state predictor conditioned on the accumulated state raises the probability on the correct expert at the transition point from 0.006 to 0.748 in synthetic experiments and from 0.42 to 0.86 in a character-level language model, while cutting transition bits per character from 6.56 to 4.01.

What carries the argument

The combination of a per-expert LIF membrane potential (beta) for temporal memory, precision-weighted gating (Pi) using inverse variance of prediction errors, and anticipatory routing that predicts the next hidden state from the accumulated context.

If this is right

  • MoE models require only a small fixed number of experts for reliable coverage across many domains once the gate carries temporal state.
  • Routing decisions become predictive, allowing expert activation before the new domain tokens arrive.
  • The super-additive interaction between temporal memory and anticipation shows that stateless predictors fail precisely because pre-transition tokens are distributionally identical to within-domain tokens.
  • The three modifications can be implemented in roughly 200 lines and require no change to the experts themselves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same stateful gate changes could improve robustness in other sparse architectures that route on affinity alone.
  • Applying the modifications at the scale of current large language models would test whether the gains persist when domain shifts arise from topic, style, or language changes rather than synthetic switches.
  • The approach embeds simple predictive-coding dynamics inside the router without requiring full variational inference over the entire model.

Load-bearing premise

The controlled four-expert synthetic transitions and the character-level language model setup capture the distribution-shift problems that occur in large-scale deployed MoE models.

What would settle it

Run the modified gate on a large-scale MoE language model trained on heterogeneous web text and measure whether the probability assigned to the correct domain expert at natural domain-switch points rises by a comparable factor.

read the original abstract

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard affinity-based routing in sparse MoE models fails at domain transitions (assigning only 0.006 probability to the correct expert), and that three lightweight gate modifications inspired by the Free Energy Principle—temporal memory via per-expert LIF membrane potential (beta), precision-weighted gating (Pi) based on inverse variance of prediction error, and anticipatory next-state prediction conditioned on beta states—raise this to 0.748 (124x improvement) in a 4-expert synthetic setup and yield better transition BPC and anticipatory probability in a character-level MoE LM. A full 2^3 ablation shows a super-additive beta x Ant interaction that closes 75% of the oracle gap.

Significance. If the empirical gains hold, the work offers a low-overhead, neuroscience-inspired route to more robust MoE routing under shifts, with the potential to reduce the number of experts needed for coverage. Strengths include the complete 2^3 ablation across 5 seeds, direct measurement of the super-additive delta (+0.446), consistent numerical improvements, and open reference implementations (~200 lines) that enable verification.

major comments (3)
  1. [Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.
  2. [§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.
  3. [§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.
minor comments (2)
  1. [§3] The LIF update rule for beta and the exact formula for Pi (inverse variance) are described at high level; adding the discrete-time equations in the main text (rather than deferring to GitHub) would improve clarity and allow readers to verify the 31x contrast claim without external code.
  2. [§4] Ablation tables/figures should explicitly state the statistical test and degrees of freedom for the super-additive interaction delta (+0.446 +/- 0.014) and confirm that all 8 conditions used the same 5 seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where the manuscript's framing, experimental scope, and comparisons can be clarified or strengthened. We address each major comment below, indicating revisions where the manuscript will be updated to better reflect the work's contributions and limitations.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.

    Authors: We agree that the mechanisms are motivated by and draw from principles in the Free Energy Principle (precision weighting, predictive coding, and temporal integration) but are not formally derived from FEP variational equations or shown to be mathematically equivalent. The title and abstract use 'recovering' to describe the functional recovery of effective routing behavior via these principles. We will revise the title, abstract, and §1 to state that the modifications are 'inspired by' the Free Energy Principle, explicitly note the inspirational rather than derivational nature of the connection, and remove any implication of formal equivalence. This is a textual clarification only. revision: yes

  2. Referee: [§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.

    Authors: The synthetic setup deliberately isolates the core failure mode of affinity routing: pre-transition tokens are distributionally indistinguishable from in-domain tokens, so no stateless router can succeed without temporal memory. This is why anticipation alone yields zero gain while the beta-Ant interaction is super-additive. We acknowledge that the experiment uses fixed, fully specialized experts and abrupt switches rather than gradual shifts or end-to-end jointly trained routers. In revision we will expand the discussion in §4 and the conclusion to explicitly state these scope limitations and outline how the mechanisms could be integrated into learned routers for gradual or partial-specialization settings. revision: partial

  3. Referee: [§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.

    Authors: The 'Standard' baseline implements the affinity-based routing that is the default in many sparse MoE implementations; the reported gains are therefore relative to that common starting point. We agree that direct comparisons against jointly optimized top-k routers and on established multi-domain benchmarks would better quantify practical advantage. Because the current experiments prioritize controlled isolation of the proposed mechanisms, we will add a paragraph in §5 and the discussion noting this gap and, where feasible within revision time, include a limited comparison using the released reference implementations against a basic learned top-k router on the same character-level task. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are measured directly; FEP supplies only design motivation.

full rationale

The paper reports concrete experimental outcomes (transition-step probabilities, BPC values, ablation deltas) computed from held-out synthetic 4-expert transitions and a character-level LM. The three gate modifications are implemented as explicit algorithms (LIF membrane, inverse-variance weighting, next-state predictor) and evaluated via 2^3 ablations whose numbers are obtained by direct measurement rather than by any equation that reduces to the inputs. The FEP reference appears only as inspirational framing for the choice of mechanisms; it does not enter the routing equations, the probability calculations, or the reported statistics. No self-citation load-bearing step, no fitted parameter renamed as prediction, and no uniqueness theorem invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters beyond ordinary MoE training hyperparameters; the beta, Pi, and anticipation modules are algorithmic additions whose internal constants are part of the proposed method rather than fitted to the target metric.

axioms (1)
  • domain assumption MoE experts become specialized to distinct data distributions in the training regime
    Invoked when the controlled experiment defines 'correct expert' at each domain transition.

pith-pipeline@v0.9.0 · 5667 in / 1346 out tokens · 46664 ms · 2026-05-09T19:06:45.161468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 34 canonical work pages · 9 internal anchors

  1. [1]

    Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

    01 ± 0. 15 (β-MoE); the β+ Ant gate places 0. 86 ± 0. 02 probability on the correct domain expert before that domain becomes visible in the input, versus 0. 42 ± 0. 12 for Standard MoE. Reference implementations are ∼ 200 lines each and released as prototype/ alongside the paper. 1 Introduction A router that knows only the current token is a navigator wit...

  2. [2]

    We propose three mechanisms — temporal memory (β), precision-weighted gating (Π ), and anticipatory routing — that raise the probability mass on the correct expert at domain- transition steps from 0. 006 ± 0. 001 to 0. 748 ± 0. 002 in controlled experiments (5 seeds, 124 × increase), reducing the experts required for 99% routing coverage from infeasibly l...

  3. [3]

    We identify a super-additiveβ× Ant interaction: anticipation alone provides no transition- step gain ( +0. 000 ± 0. 001), βalone provides modest gain ( +0. 295 ± 0. 013), but combined they close 75% of the oracle gap ( +0. 741 ± 0. 002, exceeding the sum of individual gains by +0. 446 ± 0. 014). This is a structural finding: a stateless predictor cannot de...

  4. [4]

    We validate the mechanisms in a character-level MoE language model (5 seeds), where adding βalone reduces transition-step BPC from 6. 56 ± 0. 01 (Standard MoE) to 4. 01 ± 0. 15 (β- MoE); adding the anticipatory predictor on top of βraises routing-correct probability at the transition step from 0. 60 ± 0. 22 to 0. 86 ± 0. 02, stabilising the routing decisi...

  5. [5]

    We connect MoE routing to Friston’s Free Energy Principle, showing that each mechanism has a principled motivation in the FEP equations (recursive state estimation, precision update, expected free energy minimization) and is naturally instantiated using LIF dynamics from spiking neural networks (Sections 2, 3)

  6. [6]

    Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3)

    We provide a structural explanation for why DeepSeek’s MTP module is discarded at inference: a stateless next-token predictor cannot detect approaching transitions, regardless of training budget. Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3). Reference implementations of each mechanism (5–10 lines ...

  7. [7]

    wrong input-to-expert affinity

    013 vs. 0. 006 ± 0. 001). 3.2 Precision-Weighted Gating ( Π) Theoretical basis. In the FEP hierarchy, prediction errors are precision-weighted before being passed upward ([1], Box 2): ξi = Π i ·εi where Π i =σ− 2 i is the precision (inverse variance) of expert i’s prediction error. The gain update for the precision parameter γi (where Π i =eγi) is: ∆γi ∝ ...

  8. [9]

    R WKV [16] and RetNet

    — reformulate sequence processing as convolution over learned decay kernels, directly imple- menting the U [t] = βU[t − 1] +x[t] structure at the representation level. R WKV [16] and RetNet

  9. [10]

    low affinity

    derive recurrent formulations of attention that trade expressiveness for linear-time inference. All of these apply recurrence to the representation — the token embedding that feeds into attention and FFN layers. Our βmechanism applies recurrence to the routing gate alone: the token repre- sentationxt is unchanged; only the routing decision carries tempora...

  10. [11]

    DeepSeek-V3 uses MTP with D = 1 additional prediction head, finding improvements in pass@k coding benchmarks

    showed that training language models to predict multiple future tokens simultaneously (multi- token prediction, MTP) improves both training efficiency and downstream task performance. DeepSeek-V3 uses MTP with D = 1 additional prediction head, finding improvements in pass@k coding benchmarks. Both papers discard MTP at inference. Our analysis provides a prin...

  11. [12]

    Our paper is narrower: we make a specific, falsifiable claim about three mechanisms in MoE routing, motivate each from the FEP equations, and validate it empirically

    and related work have proposed that the FEP may provide a unifying framework for cognitive architectures more broadly. Our paper is narrower: we make a specific, falsifiable claim about three mechanisms in MoE routing, motivate each from the FEP equations, and validate it empirically. We do not claim our implementation computes variational free energy in th...

  12. [13]

    (2010) The free-energy principle: a unified brain theory?.Nat Rev Neurosci11127–138https://doi.org/10.1038/nrn2787

    K. Friston, “The free-energy principle: A unified brain theory?” Nature Reviews Neuro- science, vol. 11, no. 2, pp. 127–138, 2010, doi: 10.1038/nrn2787

  13. [14]

    Training spiking neural networks using lessons from deep learning,

    J. K. Eshraghian et al. , “Training spiking neural networks using lessons from deep learning,” Proceedings of the IEEE , vol. 111, no. 9, pp. 1016–1054, 2023, A vailable: https://arxiv. org/abs/2109.12894

  14. [15]

    Layerwise recurrent router for mixture-of-experts,

    Z. Qiu et al. , “Layerwise recurrent router for mixture-of-experts,” in Proceedings of ICLR 2025, 2024. A vailable: https://arxiv.org/abs/2408.06793

  15. [16]

    Temporally Extended Mixture-of-Experts Models

    Y. Shen and J. Henderson, “Temporally extended mixture-of-experts,” arXiv preprint arXiv:2604.20156, 2026, A vailable: https://arxiv.org/abs/2604.20156

  16. [17]

    DeepSeek-V3 Technical Report

    A. Liu et al. , “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437 , 2024, A vail- able: https://arxiv.org/abs/2412.19437

  17. [18]

    ExpertFlow: Efficient mixture-of-experts inference via predictive expert caching and token scheduling,

    X. He et al. , “ExpertFlow: Efficient mixture-of-experts inference via predictive expert caching and token scheduling,” arXiv preprint arXiv:2410.17954 , 2024, A vailable: https: //arxiv.org/abs/2410.17954

  18. [19]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    N. Shazeer et al. , “Outrageously large neural networks: The sparsely-gated mixture-of- experts layer,” in Proceedings of ICLR 2017 , 2017. A vailable: https://arxiv.org/abs/ 1701.06538

  19. [20]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,

    W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity,” Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022, A vailable: https://www.jmlr.org/papers/v23/21-0998.html

  20. [21]

    Mixtral of Experts

    A. Q. Jiang et al. , “Mixtral of experts,” arXiv preprint arXiv:2401.04088 , 2024, A vailable: https://arxiv.org/abs/2401.04088

  21. [22]

    arXiv preprint arXiv:2202.09368 , year=

    Y. Zhou et al. , “Mixture-of-experts with expert choice routing,” in Advances in neural in- formation processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2202.09368

  22. [23]

    ST-MoE: Designing Stable and Transferable Sparse Expert Models

    B. Zoph et al. , “ST-MoE: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906 , 2022, A vailable: https://arxiv.org/abs/2202.08906

  23. [24]

    MoxE: Mixture of xLSTM experts with entropy-aware routing,

    A. M. O. Thiombiano et al., “MoxE: Mixture of xLSTM experts with entropy-aware routing,” arXiv preprint arXiv:2505.01459 , 2025, A vailable: https://arxiv.org/abs/2505.01459

  24. [25]

    Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a fixed-length context,” in Proceedings of ACL 2019, 2019, pp. 2978–2988. A vailable: https://arxiv.org/abs/1901.02860 24

  25. [26]

    Efficiently modeling long sequences with structured state spaces,

    A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” in Proceedings of ICLR 2022 , 2022. A vailable: https://arxiv.org/abs/2111. 00396

  26. [27]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023, A vailable: https://arxiv.org/abs/2312.00752

  27. [28]

    RWKV: Reinventing RNNs for the Transformer Era

    B. Peng et al. , “R WKV: Reinventing RNNs for the transformer era,” in Findings of EMNLP 2023, 2023, pp. 14048–14073. A vailable: https://arxiv.org/abs/2305.13048

  28. [29]

    Retentive Network: A Successor to Transformer for Large Language Models

    Y. Sun et al. , “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621 , 2023, A vailable: https://arxiv.org/abs/2307.08621

  29. [30]

    Nature Neuroscience 2(1), 79–87 (1999)

    R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects,” Nature Neuroscience, vol. 2, no. 1, pp. 79–87, 1999, doi: 10.1038/4580

  30. [31]

    An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity,

    J. C. R. Whittington and R. Bogacz, “An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity,” Neural Computation, vol. 29, no. 5, pp. 1229–1262, 2017, doi: 10.1162/NECO_a_00949

  31. [32]

    Predictive coding approximates backprop along arbitrary computation graphs,

    B. Millidge, A. Tschantz, and C. L. Buckley, “Predictive coding approximates backprop along arbitrary computation graphs,” Neural Computation , vol. 34, no. 6, pp. 1329–1368, 2022, doi: 10.1162/neco_a_01497

  32. [33]

    Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,

    X. Li and M. Wicker, “Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,” arXiv preprint arXiv:2603.09453 , 2026, A vailable: https: //arxiv.org/abs/2603.09453

  33. [34]

    Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection,

    “Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection,” arXiv preprint arXiv:2412.19108 , 2024, A vailable: https://arxiv. org/abs/2412.19108

  34. [35]

    Gloeckle, B

    F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” in Proceedings of ICML 2024 , 2024. A vailable: https://arxiv.org/abs/2404.19737

  35. [36]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735

  36. [37]

    Learning phrase representations using RNN encoder-decoder for statistical machine translation,

    K. Cho et al. , “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of EMNLP 2014 , 2014, pp. 1724–1734. A vailable: https://aclanthology.org/D14-1179/

  37. [38]

    Neural Computation29(1), 1 – 49 (2017).https: //doi.org/10.1162/NECO_a_00912

    K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo, “Active infer- ence: A process theory,” Neural Computation , vol. 29, no. 1, pp. 1–49, 2017, doi: 10.1162/NECO_a_00912

  38. [39]

    Deep active inference agents using Monte-Carlo methods,

    Z. Fountas, N. Sajid, P. A. M. Mediano, and K. Friston, “Deep active inference agents using Monte-Carlo methods,” in Advances in neural information processing systems 33 , 2020, pp. 11662–11675. A vailable: https://arxiv.org/abs/2006.04176

  39. [40]

    Learning generative state space models for active inference,

    O. Çatal, S. Wauthier, C. De Boom, T. Verbelen, and B. Dhoedt, “Learning generative state space models for active inference,” Frontiers in Computational Neuroscience , vol. 14, p. 574372, 2020, doi: 10.3389/fncom.2020.574372. 25

  40. [41]

    Predic- tive coding beyond Gaussian distributions,

    L. Pinchetti, T. Salvatori, Y. Yordanov, B. Millidge, Y. Song, and T. Lukasiewicz, “Predic- tive coding beyond Gaussian distributions,” in Advances in neural information processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2211.03481

  41. [42]

    & Lukasiewicz, T

    B. Millidge, T. Salvatori, Y. Song, R. Bogacz, and T. Lukasiewicz, “Predictive coding: To- wards a future of deep learning beyond backpropagation?” arXiv preprint arXiv:2202.09467 , 2022, A vailable: https://arxiv.org/abs/2202.09467

  42. [43]

    Path integrals, particular kinds and strange things,

    K. Friston et al. , “Path integrals, particular kinds and strange things,” Physics of Life Reviews, vol. 47, pp. 35–62, 2023, doi: 10.1016/j.plrev.2023.08.016

  43. [44]

    Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,

    E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,” IEEE Signal Processing Magazine , vol. 36, no. 6, pp. 51–63, 2019, A vailable: https: //arxiv.org/abs/1901.09948

  44. [45]

    A solution to the learning dilemma for recurrent networks of spiking neurons,

    G. Bellec et al. , “A solution to the learning dilemma for recurrent networks of spiking neurons,” Nature Communications, vol. 11, p. 3625, 2020, doi: 10.1038/s41467-020-17236- y

  45. [46]

    F., Rouat, J., & Wood, S

    Z. Zhou et al., “Spikformer: When spiking neural network meets transformer,” in Proceedings of ICLR 2023 , 2022. A vailable: https://arxiv.org/abs/2209.15425

  46. [47]

    SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,

    R.-J. Zhu, Q. Zhao, G. Li, and J. K. Eshraghian, “SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,” arXiv preprint arXiv:2302.13939 , 2023, A vail- able: https://arxiv.org/abs/2302.13939 26