arxiv: 2605.00604 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.NE

Recognition: unknown

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

Man Yung Wong (Russell)

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:06 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords mixture of expertsroutingdomain transitionfree energy principletemporal memoryanticipatory routingleaky integrate and firesparse models

0 comments

The pith

Mixture-of-experts routing succeeds at domain transitions when the gate adds temporal memory, precision weighting, and anticipation from the Free Energy Principle.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard affinity-based routing in sparse MoE models assigns near-zero probability to the correct expert when the input distribution changes, because pre-transition tokens look identical to stable-domain tokens. The paper shows that three lightweight modifications to the gate—temporal memory via a per-expert leaky integrate-and-fire membrane potential, precision-weighted gating based on inverse variance of recent prediction errors, and an anticipatory next-state predictor—raise correct-expert probability at transitions from 0.006 to 0.748 in controlled four-expert tests. The same changes applied to a character-level MoE language model cut transition-step bits per character and let the model assign high probability to the right expert before its tokens appear. Ablations confirm the gains are super-additive: neither temporal memory nor anticipation works well alone.

Core claim

Affinity routing alone cannot solve expert selection across domain transitions because it lacks state to detect approaching changes. Adding a per-expert LIF membrane potential (beta) that accumulates routing context, precision-weighted gating (Pi) that amplifies reliable experts, and a next-state predictor conditioned on the accumulated state raises the probability on the correct expert at the transition point from 0.006 to 0.748 in synthetic experiments and from 0.42 to 0.86 in a character-level language model, while cutting transition bits per character from 6.56 to 4.01.

What carries the argument

The combination of a per-expert LIF membrane potential (beta) for temporal memory, precision-weighted gating (Pi) using inverse variance of prediction errors, and anticipatory routing that predicts the next hidden state from the accumulated context.

If this is right

MoE models require only a small fixed number of experts for reliable coverage across many domains once the gate carries temporal state.
Routing decisions become predictive, allowing expert activation before the new domain tokens arrive.
The super-additive interaction between temporal memory and anticipation shows that stateless predictors fail precisely because pre-transition tokens are distributionally identical to within-domain tokens.
The three modifications can be implemented in roughly 200 lines and require no change to the experts themselves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stateful gate changes could improve robustness in other sparse architectures that route on affinity alone.
Applying the modifications at the scale of current large language models would test whether the gains persist when domain shifts arise from topic, style, or language changes rather than synthetic switches.
The approach embeds simple predictive-coding dynamics inside the router without requiring full variational inference over the entire model.

Load-bearing premise

The controlled four-expert synthetic transitions and the character-level language model setup capture the distribution-shift problems that occur in large-scale deployed MoE models.

What would settle it

Run the modified gate on a large-scale MoE language model trained on heterogeneous web text and measure whether the probability assigned to the correct domain expert at natural domain-switch points rises by a comparable factor.

read the original abstract

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanisms draw from Friston's Free Energy Principle and use LIF dynamics from spiking neural networks. An ablation across all 2^3 subsets reveals a super-additive beta x Ant interaction: anticipation alone gives nothing (+0.000 +/- 0.001); beta alone gives modest gain (+0.295 +/- 0.013); combined they close 75% of the oracle gap (+0.741 +/- 0.002, exceeding the sum by +0.446 +/- 0.014). This is structural: a stateless predictor cannot detect approaching transitions because pre-transition tokens are distributionally identical to within-domain tokens. In a character-level MoE LM (5 seeds), beta-routing reduces transition-step BPC from 6.56 +/- 0.01 (Standard) to 4.01 +/- 0.15 (beta-MoE); the beta + Ant gate places 0.86 +/- 0.02 probability on the correct domain expert before that domain appears in input, vs 0.42 +/- 0.12 for Standard MoE. Reference implementations (~200 lines each): https://github.com/russellwmy/affinity-is-not-enough

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows clear gains from adding LIF temporal memory and anticipation to MoE gates in small abrupt-switch tests, but the setups do not test whether those gains survive in large jointly-trained models.

read the letter

Two things stand out. The work demonstrates that standard affinity routing collapses at domain transitions in their 4-expert synthetic case, and that combining per-expert LIF memory with a conditioned next-state predictor raises correct-expert probability from 0.006 to 0.748. The 2^3 ablation quantifies a super-additive interaction that neither component produces on its own, and the character-level LM experiment shows corresponding drops in transition BPC plus earlier routing to the right expert. Multiple seeds and released code make the numbers easy to check.

Referee Report

3 major / 2 minor

Summary. The paper claims that standard affinity-based routing in sparse MoE models fails at domain transitions (assigning only 0.006 probability to the correct expert), and that three lightweight gate modifications inspired by the Free Energy Principle—temporal memory via per-expert LIF membrane potential (beta), precision-weighted gating (Pi) based on inverse variance of prediction error, and anticipatory next-state prediction conditioned on beta states—raise this to 0.748 (124x improvement) in a 4-expert synthetic setup and yield better transition BPC and anticipatory probability in a character-level MoE LM. A full 2^3 ablation shows a super-additive beta x Ant interaction that closes 75% of the oracle gap.

Significance. If the empirical gains hold, the work offers a low-overhead, neuroscience-inspired route to more robust MoE routing under shifts, with the potential to reduce the number of experts needed for coverage. Strengths include the complete 2^3 ablation across 5 seeds, direct measurement of the super-additive delta (+0.446), consistent numerical improvements, and open reference implementations (~200 lines) that enable verification.

major comments (3)

[Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.
[§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.
[§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.

minor comments (2)

[§3] The LIF update rule for beta and the exact formula for Pi (inverse variance) are described at high level; adding the discrete-time equations in the main text (rather than deferring to GitHub) would improve clarity and allow readers to verify the 31x contrast claim without external code.
[§4] Ablation tables/figures should explicitly state the statistical test and degrees of freedom for the super-additive interaction delta (+0.446 +/- 0.014) and confirm that all 8 conditions used the same 5 seeds.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where the manuscript's framing, experimental scope, and comparisons can be clarified or strengthened. We address each major comment below, indicating revisions where the manuscript will be updated to better reflect the work's contributions and limitations.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1: the title and framing claim to 'recover' the Free Energy Principle, yet the mechanisms are presented as drawing inspiration from FEP without deriving the gating rules (beta, Pi, anticipatory) from FEP equations or showing formal equivalence; the FEP connection motivates design but does not enter the reported computations.

Authors: We agree that the mechanisms are motivated by and draw from principles in the Free Energy Principle (precision weighting, predictive coding, and temporal integration) but are not formally derived from FEP variational equations or shown to be mathematically equivalent. The title and abstract use 'recovering' to describe the functional recovery of effective routing behavior via these principles. We will revise the title, abstract, and §1 to state that the modifications are 'inspired by' the Free Energy Principle, explicitly note the inspirational rather than derivational nature of the connection, and remove any implication of formal equivalence. This is a textual clarification only. revision: yes
Referee: [§4] §4 (synthetic transitions): the headline result (0.006 to 0.748) rests on abrupt switches between four fixed, fully specialized experts where pre-transition tokens are distributionally identical to in-domain tokens; this does not test gradual shifts, jointly trained partially specialized experts, or learned end-to-end routers that characterize large-scale deployed MoE, limiting the load-bearing claim that the modifications address realistic domain-shift failure modes.

Authors: The synthetic setup deliberately isolates the core failure mode of affinity routing: pre-transition tokens are distributionally indistinguishable from in-domain tokens, so no stateless router can succeed without temporal memory. This is why anticipation alone yields zero gain while the beta-Ant interaction is super-additive. We acknowledge that the experiment uses fixed, fully specialized experts and abrupt switches rather than gradual shifts or end-to-end jointly trained routers. In revision we will expand the discussion in §4 and the conclusion to explicitly state these scope limitations and outline how the mechanisms could be integrated into learned routers for gradual or partial-specialization settings. revision: partial
Referee: [§5] §5 (character-level LM): while beta and beta+Ant improve transition BPC and anticipatory probability, the 'Standard' baseline is pure affinity routing; no comparison is provided against standard learned MoE routers (e.g., top-k with jointly optimized gates) or multi-domain benchmarks, so the practical advantage over existing methods is not quantified.

Authors: The 'Standard' baseline implements the affinity-based routing that is the default in many sparse MoE implementations; the reported gains are therefore relative to that common starting point. We agree that direct comparisons against jointly optimized top-k routers and on established multi-domain benchmarks would better quantify practical advantage. Because the current experiments prioritize controlled isolation of the proposed mechanisms, we will add a paragraph in §5 and the discussion noting this gap and, where feasible within revision time, include a limited comparison using the released reference implementations against a basic learned top-k router on the same character-level task. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results are measured directly; FEP supplies only design motivation.

full rationale

The paper reports concrete experimental outcomes (transition-step probabilities, BPC values, ablation deltas) computed from held-out synthetic 4-expert transitions and a character-level LM. The three gate modifications are implemented as explicit algorithms (LIF membrane, inverse-variance weighting, next-state predictor) and evaluated via 2^3 ablations whose numbers are obtained by direct measurement rather than by any equation that reduces to the inputs. The FEP reference appears only as inspirational framing for the choice of mechanisms; it does not enter the routing equations, the probability calculations, or the reported statistics. No self-citation load-bearing step, no fitted parameter renamed as prediction, and no uniqueness theorem invoked. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters beyond ordinary MoE training hyperparameters; the beta, Pi, and anticipation modules are algorithmic additions whose internal constants are part of the proposed method rather than fitted to the target metric.

axioms (1)

domain assumption MoE experts become specialized to distinct data distributions in the training regime
Invoked when the controlled experiment defines 'correct expert' at each domain transition.

pith-pipeline@v0.9.0 · 5667 in / 1346 out tokens · 46664 ms · 2026-05-09T19:06:45.161468+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 34 canonical work pages · 9 internal anchors

[1]

Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts

01 ± 0. 15 (β-MoE); the β+ Ant gate places 0. 86 ± 0. 02 probability on the correct domain expert before that domain becomes visible in the input, versus 0. 42 ± 0. 12 for Standard MoE. Reference implementations are ∼ 200 lines each and released as prototype/ alongside the paper. 1 Introduction A router that knows only the current token is a navigator wit...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

We propose three mechanisms — temporal memory (β), precision-weighted gating (Π ), and anticipatory routing — that raise the probability mass on the correct expert at domain- transition steps from 0. 006 ± 0. 001 to 0. 748 ± 0. 002 in controlled experiments (5 seeds, 124 × increase), reducing the experts required for 99% routing coverage from infeasibly l...
[3]

We identify a super-additiveβ× Ant interaction: anticipation alone provides no transition- step gain ( +0. 000 ± 0. 001), βalone provides modest gain ( +0. 295 ± 0. 013), but combined they close 75% of the oracle gap ( +0. 741 ± 0. 002, exceeding the sum of individual gains by +0. 446 ± 0. 014). This is a structural ﬁnding: a stateless predictor cannot de...
[4]

We validate the mechanisms in a character-level MoE language model (5 seeds), where adding βalone reduces transition-step BPC from 6. 56 ± 0. 01 (Standard MoE) to 4. 01 ± 0. 15 (β- MoE); adding the anticipatory predictor on top of βraises routing-correct probability at the transition step from 0. 60 ± 0. 22 to 0. 86 ± 0. 02, stabilising the routing decisi...
[5]

We connect MoE routing to Friston’s Free Energy Principle, showing that each mechanism has a principled motivation in the FEP equations (recursive state estimation, precision update, expected free energy minimization) and is naturally instantiated using LIF dynamics from spiking neural networks (Sections 2, 3)
[6]

Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3)

We provide a structural explanation for why DeepSeek’s MTP module is discarded at inference: a stateless next-token predictor cannot detect approaching transitions, regardless of training budget. Conditioning the predictor on the β-accumulated hidden state restores its routing utility (Section 5.3). Reference implementations of each mechanism (5–10 lines ...

2010
[7]

wrong input-to-expert aﬃnity

013 vs. 0. 006 ± 0. 001). 3.2 Precision-Weighted Gating ( Π) Theoretical basis. In the FEP hierarchy, prediction errors are precision-weighted before being passed upward ([1], Box 2): ξi = Π i ·εi where Π i =σ− 2 i is the precision (inverse variance) of expert i’s prediction error. The gain update for the precision parameter γi (where Π i =eγi) is: ∆γi ∝ ...
[9]

R WKV [16] and RetNet

— reformulate sequence processing as convolution over learned decay kernels, directly imple- menting the U [t] = βU[t − 1] +x[t] structure at the representation level. R WKV [16] and RetNet
[10]

low aﬃnity

derive recurrent formulations of attention that trade expressiveness for linear-time inference. All of these apply recurrence to the representation — the token embedding that feeds into attention and FFN layers. Our βmechanism applies recurrence to the routing gate alone: the token repre- sentationxt is unchanged; only the routing decision carries tempora...

2001
[11]

DeepSeek-V3 uses MTP with D = 1 additional prediction head, ﬁnding improvements in pass@k coding benchmarks

showed that training language models to predict multiple future tokens simultaneously (multi- token prediction, MTP) improves both training eﬃciency and downstream task performance. DeepSeek-V3 uses MTP with D = 1 additional prediction head, ﬁnding improvements in pass@k coding benchmarks. Both papers discard MTP at inference. Our analysis provides a prin...
[12]

Our paper is narrower: we make a speciﬁc, falsiﬁable claim about three mechanisms in MoE routing, motivate each from the FEP equations, and validate it empirically

and related work have proposed that the FEP may provide a unifying framework for cognitive architectures more broadly. Our paper is narrower: we make a speciﬁc, falsiﬁable claim about three mechanisms in MoE routing, motivate each from the FEP equations, and validate it empirically. We do not claim our implementation computes variational free energy in th...

work page arXiv
[13]

(2010) The free-energy principle: a unified brain theory?.Nat Rev Neurosci11127–138https://doi.org/10.1038/nrn2787

K. Friston, “The free-energy principle: A uniﬁed brain theory?” Nature Reviews Neuro- science, vol. 11, no. 2, pp. 127–138, 2010, doi: 10.1038/nrn2787

work page doi:10.1038/nrn2787 2010
[14]

Training spiking neural networks using lessons from deep learning,

J. K. Eshraghian et al. , “Training spiking neural networks using lessons from deep learning,” Proceedings of the IEEE , vol. 111, no. 9, pp. 1016–1054, 2023, A vailable: https://arxiv. org/abs/2109.12894

work page arXiv 2023
[15]

Layerwise recurrent router for mixture-of-experts,

Z. Qiu et al. , “Layerwise recurrent router for mixture-of-experts,” in Proceedings of ICLR 2025, 2024. A vailable: https://arxiv.org/abs/2408.06793

work page arXiv 2025
[16]

Temporally Extended Mixture-of-Experts Models

Y. Shen and J. Henderson, “Temporally extended mixture-of-experts,” arXiv preprint arXiv:2604.20156, 2026, A vailable: https://arxiv.org/abs/2604.20156

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

DeepSeek-V3 Technical Report

A. Liu et al. , “DeepSeek-V3 technical report,” arXiv preprint arXiv:2412.19437 , 2024, A vail- able: https://arxiv.org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

ExpertFlow: Eﬃcient mixture-of-experts inference via predictive expert caching and token scheduling,

X. He et al. , “ExpertFlow: Eﬃcient mixture-of-experts inference via predictive expert caching and token scheduling,” arXiv preprint arXiv:2410.17954 , 2024, A vailable: https: //arxiv.org/abs/2410.17954

work page arXiv 2024
[19]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

N. Shazeer et al. , “Outrageously large neural networks: The sparsely-gated mixture-of- experts layer,” in Proceedings of ICLR 2017 , 2017. A vailable: https://arxiv.org/abs/ 1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[20]

Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity,

W. Fedus, B. Zoph, and N. Shazeer, “Switch transformers: Scaling to trillion parameter models with simple and eﬃcient sparsity,” Journal of Machine Learning Research , vol. 23, no. 120, pp. 1–39, 2022, A vailable: https://www.jmlr.org/papers/v23/21-0998.html

2022
[21]

Mixtral of Experts

A. Q. Jiang et al. , “Mixtral of experts,” arXiv preprint arXiv:2401.04088 , 2024, A vailable: https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

arXiv preprint arXiv:2202.09368 , year=

Y. Zhou et al. , “Mixture-of-experts with expert choice routing,” in Advances in neural in- formation processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2202.09368

work page arXiv 2022
[23]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

B. Zoph et al. , “ST-MoE: Designing stable and transferable sparse expert models,” arXiv preprint arXiv:2202.08906 , 2022, A vailable: https://arxiv.org/abs/2202.08906

work page internal anchor Pith review arXiv 2022
[24]

MoxE: Mixture of xLSTM experts with entropy-aware routing,

A. M. O. Thiombiano et al., “MoxE: Mixture of xLSTM experts with entropy-aware routing,” arXiv preprint arXiv:2505.01459 , 2025, A vailable: https://arxiv.org/abs/2505.01459

work page arXiv 2025
[25]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive language models beyond a ﬁxed-length context,” in Proceedings of ACL 2019, 2019, pp. 2978–2988. A vailable: https://arxiv.org/abs/1901.02860 24

work page Pith review arXiv 2019
[26]

Eﬃciently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. Ré, “Eﬃciently modeling long sequences with structured state spaces,” in Proceedings of ICLR 2022 , 2022. A vailable: https://arxiv.org/abs/2111. 00396

2022
[27]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752 , 2023, A vailable: https://arxiv.org/abs/2312.00752

work page internal anchor Pith review arXiv 2023
[28]

RWKV: Reinventing RNNs for the Transformer Era

B. Peng et al. , “R WKV: Reinventing RNNs for the transformer era,” in Findings of EMNLP 2023, 2023, pp. 14048–14073. A vailable: https://arxiv.org/abs/2305.13048

work page internal anchor Pith review arXiv 2023
[29]

Retentive Network: A Successor to Transformer for Large Language Models

Y. Sun et al. , “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621 , 2023, A vailable: https://arxiv.org/abs/2307.08621

work page internal anchor Pith review arXiv 2023
[30]

Nature Neuroscience 2(1), 79–87 (1999)

R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-ﬁeld eﬀects,” Nature Neuroscience, vol. 2, no. 1, pp. 79–87, 1999, doi: 10.1038/4580

work page doi:10.1038/4580 1999
[31]

An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity,

J. C. R. Whittington and R. Bogacz, “An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity,” Neural Computation, vol. 29, no. 5, pp. 1229–1262, 2017, doi: 10.1162/NECO_a_00949

work page doi:10.1162/neco_a_00949 2017
[32]

Predictive coding approximates backprop along arbitrary computation graphs,

B. Millidge, A. Tschantz, and C. L. Buckley, “Predictive coding approximates backprop along arbitrary computation graphs,” Neural Computation , vol. 34, no. 6, pp. 1329–1368, 2022, doi: 10.1162/neco_a_01497

work page doi:10.1162/neco_a_01497 2022
[33]

Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,

X. Li and M. Wicker, “Variational routing: A scalable Bayesian framework for calibrated mixture-of-experts transformers,” arXiv preprint arXiv:2603.09453 , 2026, A vailable: https: //arxiv.org/abs/2603.09453

work page arXiv 2026
[34]

Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection,

“Graph mixture of experts and memory-augmented routers for multivariate time series anomaly detection,” arXiv preprint arXiv:2412.19108 , 2024, A vailable: https://arxiv. org/abs/2412.19108

work page arXiv 2024
[35]

Gloeckle, B

F. Gloeckle, B. Y. Idrissi, B. Rozière, D. Lopez-Paz, and G. Synnaeve, “Better & faster large language models via multi-token prediction,” in Proceedings of ICML 2024 , 2024. A vailable: https://arxiv.org/abs/2404.19737

work page arXiv 2024
[36]

Hochreiter and J

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/neco.1997.9.8.1735

work page doi:10.1162/neco.1997.9.8.1735 1997
[37]

Learning phrase representations using RNN encoder-decoder for statistical machine translation,

K. Cho et al. , “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” in Proceedings of EMNLP 2014 , 2014, pp. 1724–1734. A vailable: https://aclanthology.org/D14-1179/

2014
[38]

Neural Computation29(1), 1 – 49 (2017).https: //doi.org/10.1162/NECO_a_00912

K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo, “Active infer- ence: A process theory,” Neural Computation , vol. 29, no. 1, pp. 1–49, 2017, doi: 10.1162/NECO_a_00912

work page doi:10.1162/neco_a_00912 2017
[39]

Deep active inference agents using Monte-Carlo methods,

Z. Fountas, N. Sajid, P. A. M. Mediano, and K. Friston, “Deep active inference agents using Monte-Carlo methods,” in Advances in neural information processing systems 33 , 2020, pp. 11662–11675. A vailable: https://arxiv.org/abs/2006.04176

work page arXiv 2020
[40]

Learning generative state space models for active inference,

O. Çatal, S. Wauthier, C. De Boom, T. Verbelen, and B. Dhoedt, “Learning generative state space models for active inference,” Frontiers in Computational Neuroscience , vol. 14, p. 574372, 2020, doi: 10.3389/fncom.2020.574372. 25

work page doi:10.3389/fncom.2020.574372 2020
[41]

Predic- tive coding beyond Gaussian distributions,

L. Pinchetti, T. Salvatori, Y. Yordanov, B. Millidge, Y. Song, and T. Lukasiewicz, “Predic- tive coding beyond Gaussian distributions,” in Advances in neural information processing systems 35 , 2022. A vailable: https://arxiv.org/abs/2211.03481

work page arXiv 2022
[42]

& Lukasiewicz, T

B. Millidge, T. Salvatori, Y. Song, R. Bogacz, and T. Lukasiewicz, “Predictive coding: To- wards a future of deep learning beyond backpropagation?” arXiv preprint arXiv:2202.09467 , 2022, A vailable: https://arxiv.org/abs/2202.09467

work page arXiv 2022
[43]

Path integrals, particular kinds and strange things,

K. Friston et al. , “Path integrals, particular kinds and strange things,” Physics of Life Reviews, vol. 47, pp. 35–62, 2023, doi: 10.1016/j.plrev.2023.08.016

work page doi:10.1016/j.plrev.2023.08.016 2023
[44]

Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,

E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks,” IEEE Signal Processing Magazine , vol. 36, no. 6, pp. 51–63, 2019, A vailable: https: //arxiv.org/abs/1901.09948

work page arXiv 2019
[45]

A solution to the learning dilemma for recurrent networks of spiking neurons,

G. Bellec et al. , “A solution to the learning dilemma for recurrent networks of spiking neurons,” Nature Communications, vol. 11, p. 3625, 2020, doi: 10.1038/s41467-020-17236- y

work page doi:10.1038/s41467-020-17236- 2020
[46]

F., Rouat, J., & Wood, S

Z. Zhou et al., “Spikformer: When spiking neural network meets transformer,” in Proceedings of ICLR 2023 , 2022. A vailable: https://arxiv.org/abs/2209.15425

work page arXiv 2023
[47]

SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,

R.-J. Zhu, Q. Zhao, G. Li, and J. K. Eshraghian, “SpikeGPT: Generative pre-trained lan- guage model with spiking neural networks,” arXiv preprint arXiv:2302.13939 , 2023, A vail- able: https://arxiv.org/abs/2302.13939 26

work page arXiv 2023