arxiv: 2604.07867 · v2 · submitted 2026-04-09 · ❄️ cond-mat.stat-mech · cond-mat.dis-nn

Recognition: unknown

Stochastic Thermodynamics for Autoregressive Generative Models: A Non-Markovian Perspective

Takahiro Sagawa

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification ❄️ cond-mat.stat-mech cond-mat.dis-nn

keywords autoregressive modelsentropy productionstochastic thermodynamicsnon-Markovian processestransformersgenerative modelsretrospective inferenceLLMs

0 comments

The pith

Autoregressive generative models admit an entropy production from stochastic thermodynamics that can be efficiently estimated from trajectories and decomposes exactly into per-step retrospective inference terms despite non-Markovian output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive models including Transformers and recurrent networks generate sequences by sampling each token from a deterministic summary of the past, which produces genuinely non-Markovian observed dynamics. The paper introduces a stochastic thermodynamics framework that defines entropy production for this entire class of architectures. This entropy production can be estimated directly from sampled trajectories without incurring exponential sampling cost. It decomposes exactly into non-negative per-step contributions expressed through retrospective inference, and each contribution further factors into an information-theoretic compression loss plus a model mismatch term. The claims are illustrated with an analysis of GPT-2 and with an exact analytical reduction in the linear Gaussian case that recovers the Kalman innovation representation.

Core claim

We develop a general theoretical framework based on stochastic thermodynamics for autoregressive generative models and introduce the entropy production, which can be efficiently estimated from sampled trajectories without exponential sampling cost, despite the non-Markovian nature of the observed dynamics. The entropy production decomposes exactly into non-negative per-step contributions in terms of retrospective inference, and each of those terms further splits into information-theoretically meaningful terms: a compression loss and a model mismatch.

What carries the argument

Entropy production defined for non-Markovian observed processes generated by autoregressive conditional distributions, together with its exact decomposition into retrospective-inference contributions.

If this is right

Entropy production can be computed tractably from finite trajectories in models like GPT-2.
The quantity decomposes into per-step non-negative terms that separate compression loss from model mismatch.
Token-level entropy production in GPT-2 is dominated by syntactic artifacts while sentence-level values may distinguish causally ordered from non-causal text.
In the linear Gaussian case the framework reduces to the Kalman innovation representation with a closed-form expression for entropy production.
The same decomposition applies uniformly across Transformers, RNNs, state-space models, and Mamba architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition supplies a quantitative handle on irreversibility that could be tracked during training to monitor how well a model captures temporal structure.
Sentence-level entropy production may offer a diagnostic for whether a model has learned causal versus merely statistical patterns in text.
The framework opens a route to comparing irreversibility across different autoregressive architectures without requiring Markovian approximations.

Load-bearing premise

The entropy production for genuinely non-Markovian observed processes generated by autoregressive models admits an efficient estimator and an exact non-negative decomposition into retrospective-inference terms without additional assumptions that fail for high-dimensional models such as Transformers.

What would settle it

A high-dimensional autoregressive model such as a large Transformer in which the proposed estimator for entropy production requires exponential sampling cost or in which the per-step retrospective-inference terms fail to remain non-negative.

Figures

Figures reproduced from arXiv: 2604.07867 by Takahiro Sagawa.

**Figure 2.** Figure 2: FIG. 2. Schematic of the causal structure for the recursive [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Distribution of the per-token stochastic entropy production for sequences of [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4. Per-token stochastic entropy production evaluated on GPT-2 for 30 causal texts (red) and 30 non-causal texts (blue) [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. Numerical verification of the analytical entropy pro [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Convergence of the Monte Carlo estimates of the per-token entropy production for [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7. Convergence of the Monte Carlo mean of [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8. Same GPT-2 experiment as Figure 4, but using another input text set generated by an independent session of Claude [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9. Same GPT-2 experiment as Figure 4, but using an input text set independently generated by GPT-5.4 Pro [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10. Same GPT-2 experiment as Figure 4, but using an input text set independently generated by Gemini 3.1 Pro [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

read the original abstract

Autoregressive generative models -- including Transformers, recurrent neural networks, classical Kalman filters, state space models, and Mamba -- all generate sequences by sampling each output from a deterministic summary of the past, producing genuinely non-Markovian observed processes. We develop a general theoretical framework based on stochastic thermodynamics for this class of architectures and introduce the entropy production, which can be efficiently estimated from sampled trajectories without exponential sampling cost, despite the non-Markovian nature of the observed dynamics. As a proof-of-concept experiment with a large language model (LLM), we evaluate the entropy production for a pre-trained Transformer-based model, GPT-2. We find that the token-level entropy production is dominated by a syntactic artifact, while the sentence-level entropy production may yield a more interpretable signal in comparisons between causally ordered and non-causal text sets. We also demonstrate the framework in the linear Gaussian case, where the model reduces to the Kalman innovation representation and the entropy production admits an analytical expression. We further show that the entropy production decomposes exactly into non-negative per-step contributions in terms of retrospective inference, and each of those terms further splits into information-theoretically meaningful terms: a compression loss and a model mismatch. Our results establish a bridge between stochastic thermodynamics and modern generative models, and provide a starting point for quantifying irreversibility in a broad class of highly non-Markovian processes such as LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims an efficient trajectory estimator for entropy production in non-Markovian autoregressive models plus an exact retrospective decomposition, but the abstract leaves the derivations and quantitative controls too thin to judge whether the efficiency holds for real-scale Transformers.

read the letter

The main point is that the authors extend stochastic thermodynamics to autoregressive generators by defining entropy production for the observed non-Markovian sequence and claiming it can be estimated directly from sampled trajectories without exponential cost. They further state that this quantity decomposes exactly into non-negative per-step terms based on retrospective inference, each splitting into a compression loss and a model mismatch. The linear-Gaussian case reduces to the Kalman innovation form and gets an analytical expression; the GPT-2 run is offered as a proof of concept showing token-level production dominated by syntax and sentence-level production potentially more interpretable for causal versus non-causal text comparisons.

Referee Report

3 major / 1 minor

Summary. The manuscript develops a stochastic thermodynamics framework for autoregressive generative models (Transformers, RNNs, state-space models, etc.) that generate non-Markovian observed sequences. It defines an entropy production for these processes that admits an efficient estimator from sampled trajectories, avoiding exponential sampling costs. The entropy production is claimed to decompose exactly into non-negative per-step contributions expressed via retrospective inference; each term further splits into a compression loss and a model mismatch. Analytical results are given for the linear-Gaussian (Kalman) case, and a proof-of-concept evaluation is performed on pre-trained GPT-2, reporting that token-level entropy production is dominated by syntactic artifacts while sentence-level values may distinguish causal from non-causal text.

Significance. If the efficient estimator and exact non-negative decomposition hold without hidden assumptions that fail for high-dimensional trained models, the work supplies a concrete bridge between non-equilibrium thermodynamics and modern sequence models. It would enable quantitative study of irreversibility, information processing, and thermodynamic-like accounting in LLMs and related architectures, with potential downstream uses in model analysis, training diagnostics, and interpretability.

major comments (3)

[Theoretical framework and decomposition statements] The central claim that entropy production admits an efficient estimator and an exact non-negative decomposition into retrospective-inference terms (each splitting into compression loss and model mismatch) is load-bearing. The manuscript must supply the explicit derivation showing that the path-probability ratio reduces to quantities computable from the forward conditionals and a single backward pass or summary statistic; without this, the usual exponential cost over histories reappears for genuinely non-Markovian processes.
[GPT-2 experiment] § on GPT-2 experiment: the evaluation is presented only as a proof-of-concept with no quantitative metrics, baseline comparisons, statistical controls, or error bars. This leaves open whether the reported dominance of syntactic artifacts at token level and the sentence-level distinction between causal and non-causal text are robust or artifacts of the particular sampling and aggregation choices.
[Linear-Gaussian / Kalman case] Linear-Gaussian case: the claim of an analytical expression is important for validation, yet the manuscript must demonstrate that the general decomposition recovers the known Kalman-filter entropy-production formula without additional assumptions; otherwise the reduction serves only as a consistency check rather than independent support.

minor comments (1)

[Notation and definitions] Notation for retrospective inference and the two information-theoretic splits should be introduced with explicit equations early in the theoretical section to avoid ambiguity when the same symbols appear in both the general and linear-Gaussian treatments.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify the presentation of our theoretical framework and experimental results. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Theoretical framework and decomposition statements] The central claim that entropy production admits an efficient estimator and an exact non-negative decomposition into retrospective-inference terms (each splitting into compression loss and model mismatch) is load-bearing. The manuscript must supply the explicit derivation showing that the path-probability ratio reduces to quantities computable from the forward conditionals and a single backward pass or summary statistic; without this, the usual exponential cost over histories reappears for genuinely non-Markovian processes.

Authors: We agree that an explicit, self-contained derivation is essential for the central claims. While the manuscript derives the entropy production and its decomposition from the path-probability ratio using the autoregressive structure, we acknowledge that the reduction to forward conditionals and retrospective inference (via a single backward pass) could be presented more transparently. In the revised manuscript we will expand the theoretical section with a detailed, step-by-step derivation that explicitly shows how the non-Markovian path ratio factors into computable per-step terms without exponential summation over histories. revision: yes
Referee: [GPT-2 experiment] § on GPT-2 experiment: the evaluation is presented only as a proof-of-concept with no quantitative metrics, baseline comparisons, statistical controls, or error bars. This leaves open whether the reported dominance of syntactic artifacts at token level and the sentence-level distinction between causal and non-causal text are robust or artifacts of the particular sampling and aggregation choices.

Authors: We accept that the GPT-2 section, presented as a proof-of-concept, would benefit from greater rigor. In the revision we will add quantitative metrics (mean entropy-production values with standard errors across multiple trajectories), baseline comparisons (e.g., against shuffled or randomly generated text), and statistical controls (significance tests for the reported distinctions). We will also document the exact sampling and aggregation procedures to allow reproducibility and assessment of robustness. revision: yes
Referee: [Linear-Gaussian / Kalman case] Linear-Gaussian case: the claim of an analytical expression is important for validation, yet the manuscript must demonstrate that the general decomposition recovers the known Kalman-filter entropy-production formula without additional assumptions; otherwise the reduction serves only as a consistency check rather than independent support.

Authors: We agree that an explicit recovery of the known Kalman-filter result strengthens the validation. In the revised manuscript we will include a dedicated subsection that applies the general decomposition directly to the linear-Gaussian (Kalman) case and verifies that it reproduces the standard entropy-production formula without extra assumptions, thereby confirming consistency with established results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces entropy production for non-Markovian processes generated by autoregressive models and claims an efficient estimator from trajectories plus an exact non-negative decomposition into retrospective-inference terms that further split into compression loss and model mismatch. No equations or sections are provided that reduce the claimed decomposition or estimator to a tautological redefinition of the input quantities (such as the forward conditionals themselves) or to a self-citation chain whose load-bearing step is unverified. The linear-Gaussian analytical case and GPT-2 experiment are presented as independent verifications rather than forced by construction. The framework is therefore self-contained against external benchmarks from stochastic thermodynamics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework extends standard stochastic-thermodynamics definitions to non-Markovian autoregressive dynamics; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Standard assumptions of stochastic thermodynamics for defining entropy production in driven non-equilibrium systems
The paper extends these assumptions from Markovian to non-Markovian observed processes generated by autoregressive models.

pith-pipeline@v0.9.0 · 5553 in / 1387 out tokens · 54387 ms · 2026-05-10T17:13:54.579863+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

98 extracted references · 4 canonical work pages · 2 internal anchors

[1]

The glass slipped from her hand. It fell to the floor. It broke into many pieces. She swept them up carefully

This dominance of the syntactic artifact inσtoken moti- vates the block-level coarse-graining, which has no coun- terpart in previous studies of forward–backward asym- metry in LLMs [54, 55]. In fact, the block-level values shown in Figure 3 (b) are much smaller, which is consis- tent with the discussion in Section IV D. Note that the reference distributi...
[2]

Markovian

General definition In this paper, we say that a processx t provides a Markovian embeddingof the observed non-Markovian processy t ifx t is Markovian and the joint law factor- izes as P→(x1:T , y1:T ) =p(x 1) T−1Y t=1 pt(xt+1 |x t) TY t=1 qt(yt |x t). (A1) That is,y t is emitted memorylessly from the Markovian statex t at each time step. This definition ca...
[3]

First, remember that in the general setting in- cluding Transformers,h t = Φt(y1,

Relation to the present framework We examine how the autoregressive framework of the main text relates to the Markovian embedding defined above. First, remember that in the general setting in- cluding Transformers,h t = Φt(y1, . . . , yt) does not factor through a two-argument recursion, and thus even (ht, yt) is not Markovian in general. In the recursive...
[4]

If the joint process (x t, yt) satisfies the factorization (A1) with xt =x t, thenx t constitutes a Markovian embedding of yt

True environmental state as a possible Markovian embedding Behind the observationsy t there may exist a true en- vironmental statex t whose dynamics generatesy t. If the joint process (x t, yt) satisfies the factorization (A1) with xt =x t, thenx t constitutes a Markovian embedding of yt. The Kalman filter example (Section VI) is a concrete instance:x t e...
[5]

(28) contains the boundary termp(y 1 |h 0)

Details of sampling from GPT-2 For the sampling experiment from GPT-2 itself (Sec- tion V A), the path probability in Eq. (28) contains the boundary termp(y 1 |h 0). This term must be specified separately, because the tokenizer used in our implemen- tation does not prepend an initial beginning-of-sequence (BOS) token automatically. In the HuggingFace GPT-...
[6]

To examine how the Monte Carlo estimates stabilize as the sample size grows, Figure 6 plots the cumulative sample mean ofσ token/Tandσ block/T ′ forT= 120, as a function ofN

Convergence of entropy production and fluctuation theorem We next show supplemental numerical results for the Monte Carlo sampling from GPT-2. To examine how the Monte Carlo estimates stabilize as the sample size grows, Figure 6 plots the cumulative sample mean ofσ token/Tandσ block/T ′ forT= 120, as a function ofN. The shaded bands indicate 95% con- fide...

2000
[7]

The glass slipped from her hand. It fell to the floor. It broke into many pieces. She swept them up carefully

Input text sets and supplemental results The 60 English-language texts used for Figure 4 in Sec- tion V B (30 causal and 30 non-causal) were generated by inputting a fixed prompt into a new chat session of Claude Opus 4.6 (Anthropic) and using the output with- out manual revision or selection. The prompt specifies the desired structure (four short sentenc...

2000
[8]

Stochastic thermodynamics, fluctuation the- orems and molecular machines,

U. Seifert, “Stochastic thermodynamics, fluctuation the- orems and molecular machines,”Rep. Prog. Phys.75, 126001 (2012)

2012
[9]

Peliti and S

L. Peliti and S. Pigolotti,Stochastic Thermodynamics: An Introduction(Princeton University Press, Princeton, 2021)

2021
[10]

Nonequilibrium equality for free energy differences,

C. Jarzynski, “Nonequilibrium equality for free energy differences,”Phys. Rev. Lett.78, 2690–2693 (1997)

1997
[11]

Entropy production fluctuation theorem and the nonequilibrium work relation for free energy dif- ferences,

G. E. Crooks, “Entropy production fluctuation theorem and the nonequilibrium work relation for free energy dif- ferences,”Phys. Rev. E60, 2721–2726 (1999)

1999
[12]

Entropy production along a stochastic tra- jectory and an integral fluctuation theorem,

U. Seifert, “Entropy production along a stochastic tra- jectory and an integral fluctuation theorem,”Phys. Rev. Lett.95, 040602 (2005)

2005
[13]

Lower bounds on dissipation upon coarse graining,

A. G´ omez-Mar´ ın, J. M. R. Parrondo, and C. Van den Broeck, “Lower bounds on dissipation upon coarse graining,”Phys. Rev. E78, 011107 (2008)

2008
[14]

Estimating dissipa- tion from single stationary trajectories,

´E. Rold´ an and J. M. R. Parrondo, “Estimating dissipa- tion from single stationary trajectories,”Phys. Rev. Lett. 105, 150607 (2010)

2010
[15]

Entropy production and Kullback–Leibler divergence between stationary tra- jectories of discrete systems,

´E. Rold´ an and J. M. R. Parrondo, “Entropy production and Kullback–Leibler divergence between stationary tra- jectories of discrete systems,”Phys. Rev. E85, 031129 (2012). 25

2012
[16]

Fluctuation theorem for hidden entropy production,

K. Kawaguchi and Y. Nakayama, “Fluctuation theorem for hidden entropy production,”Phys. Rev. E88, 022147 (2013)

2013
[17]

Hidden slow degrees of free- dom and fluctuation theorems: an analytically solv- able model,

M. Kahlen and J. Ehrich, “Hidden slow degrees of free- dom and fluctuation theorems: an analytically solv- able model,”J. Stat. Mech.: Theory Exp.2018, 063204 (2018)

2018
[18]

Marginal and condi- tional second laws of thermodynamics,

G. E. Crooks and S. E. Still, “Marginal and condi- tional second laws of thermodynamics,”EPL125, 40005 (2019)

2019
[19]

Irre- versibility in dynamical phases and transitions,

D. S. Seara, B. B. Machta, and M. P. Murrell, “Irre- versibility in dynamical phases and transitions,”Nature Commun.12, 392 (2021)

2021
[20]

The Jarzynski relation, fluc- tuation theorems, and stochastic thermodynamics for non-Markovian processes,

T. Speck and U. Seifert, “The Jarzynski relation, fluc- tuation theorems, and stochastic thermodynamics for non-Markovian processes,”J. Stat. Mech.: Theory Exp. 2007, L09002 (2007)

2007
[21]

Fluctuation theorems for non- linear generalized Langevin systems,

T. Ohkuma and T. Ohta, “Fluctuation theorems for non- linear generalized Langevin systems,”J. Stat. Mech.: Theory Exp.2007, P10010 (2007)

2007
[22]

Fluctuation relations and coarse-graining,

S. Rahav and C. Jarzynski, “Fluctuation relations and coarse-graining,”J. Stat. Mech.P09012 (2007)

2007
[23]

Entropy production and coarse graining in Markov pro- cesses,

A. Puglisi, S. Pigolotti, L. Rondoni, and A. Vulpiani, “Entropy production and coarse graining in Markov pro- cesses,”J. Stat. Mech.P05015 (2010)

2010
[24]

Stochastic thermodynamics under coarse graining,

M. Esposito, “Stochastic thermodynamics under coarse graining,”Phys. Rev. E85, 041125 (2012)

2012
[25]

Entropy production and fluctuation theorems for Langevin processes under continuous non-Markovian feedback control,

T. Munakata and M. L. Rosinberg, “Entropy production and fluctuation theorems for Langevin processes under continuous non-Markovian feedback control,”Phys. Rev. Lett.112, 180601 (2014)

2014
[26]

Stochas- tic thermodynamics of Langevin systems under time- delayed feedback control: Second-law-like inequalities,

M. L. Rosinberg, T. Munakata, and G. Tarjus, “Stochas- tic thermodynamics of Langevin systems under time- delayed feedback control: Second-law-like inequalities,” Phys. Rev. E91, 042114 (2015)

2015
[27]

Effective thermodynam- ics for a marginal observer,

M. Polettini and M. Esposito, “Effective thermodynam- ics for a marginal observer,”Phys. Rev. Lett.119, 240601 (2017)

2017
[28]

Time- resolved statistics of snippets as general framework for model-free entropy estimators,

J. van der Meer, J. Deg¨ unther, and U. Seifert, “Time- resolved statistics of snippets as general framework for model-free entropy estimators,”Phys. Rev. Lett.130, 257101 (2023)

2023
[29]

Fluc- tuating entropy production on the coarse-grained level: inference and localization of irreversibility,

J. Deg¨ unther, J. van der Meer, and U. Seifert, “Fluc- tuating entropy production on the coarse-grained level: inference and localization of irreversibility,”Phys. Rev. Research6, 023175 (2024)

2024
[30]

Stochastic thermodynamics for classical non-Markov jump processes

K. Kanazawa and A. Dechant, “Stochastic thermo- dynamics for classical non-Markov jump processes,” arXiv:2506.04726 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Thermodynamics of information,

J. M. R. Parrondo, J. M. Horowitz, and T. Sagawa, “Thermodynamics of information,”Nature Phys.11, 131–139 (2015)

2015
[32]

Generalized Jarzynski equality under nonequilibrium feedback control,

T. Sagawa and M. Ueda, “Generalized Jarzynski equality under nonequilibrium feedback control,”Phys. Rev. Lett. 104, 090602 (2010)

2010
[33]

Fluctuation theorem with in- formation exchange: Role of correlations in stochastic thermodynamics,

T. Sagawa and M. Ueda, “Fluctuation theorem with in- formation exchange: Role of correlations in stochastic thermodynamics,”Phys. Rev. Lett.109, 180602 (2012)

2012
[34]

Nonequilibrium thermody- namics of feedback control,

T. Sagawa and M. Ueda, “Nonequilibrium thermody- namics of feedback control,”Phys. Rev. E85, 021104 (2012)

2012
[35]

Ther- modynamics of prediction,

S. Still, D. A. Sivak, A. J. Bell, and G. E. Crooks, “Ther- modynamics of prediction,”Phys. Rev. Lett.109, 120604 (2012)

2012
[36]

Information thermodynamics on causal networks,

S. Ito and T. Sagawa, “Information thermodynamics on causal networks,”Phys. Rev. Lett.111, 180603 (2013)

2013
[37]

Thermodynamics with continuous information flow,

J. M. Horowitz and M. Esposito, “Thermodynamics with continuous information flow,”Phys. Rev. X4, 031015 (2014)

2014
[38]

Backward transfer entropy: Informational mea- sure for detecting hidden Markov models and its inter- pretations in thermodynamics, gambling and causality,

S. Ito, “Backward transfer entropy: Informational mea- sure for detecting hidden Markov models and its inter- pretations in thermodynamics, gambling and causality,” Sci. Rep.6, 36831 (2016)

2016
[39]

Thermodynamics of computing with circuits,

D. H. Wolpert and A. Kolchinsky, “Thermodynamics of computing with circuits,”New J. Phys.22, 063047 (2020)

2020
[40]

Thermodynamics of Gam- bling Demons,

G. Manzano, D. Subero, O. Maillet, R. Fazio, J. P. Pekola, and ´E. Rold´ an, “Thermodynamics of Gam- bling Demons,”Phys. Rev. Lett.126, 080603 (2021)

2021
[41]

Thermodynamics of computations with absolute irre- versibility, unidirectional transitions, and stochastic com- putation times,

G. Manzano, G. Karde¸ s,´E. Rold´ an, and D. H. Wolpert, “Thermodynamics of computations with absolute irre- versibility, unidirectional transitions, and stochastic com- putation times,”Phys. Rev. X14, 021026 (2024)

2024
[42]

Is stochastic thermodynamics the key to understanding the energy costs of computation?

D. H. Wolpert,et al., “Is stochastic thermodynamics the key to understanding the energy costs of computation?” Proc. Natl. Acad. Sci. U.S.A.121, e2321112121 (2024)

2024
[43]

Stochastic thermodynamics of learning,

S. Goldt and U. Seifert, “Stochastic thermodynamics of learning,”Phys. Rev. Lett.118, 010601 (2017)

2017
[44]

Stochastic Thermodynamics of Associative Memory

S. Rooke, D. Krotov, V. Balasubramanian, and D. H. Wolpert, “Stochastic thermodynamics of associa- tive memory,” arXiv:2601.01253 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

At- tention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “At- tention is all you need,” inAdvances in Neural Informa- tion Processing Systems 30(NeurIPS, 2017), pp. 5998– 6008

2017
[46]

Language models are unsupervised multi- task learners,

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multi- task learners,” OpenAI Technical Report (2019)

2019
[47]

Language models are few-shot learn- ers,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language models are few-shot learn- ers,” inAdvances in Neural Information Processing Sys- tems 33(NeurIPS, 2020), pp. 1877–1901

2020
[48]

Hopfield networks is all you need,

H. Ramsauer,et al., “Hopfield networks is all you need,” inThe Ninth International Conference on Learning Rep- resentations(ICLR, 2021)

2021
[49]

Finding structure in time,

J. L. Elman, “Finding structure in time,”Cogn. Sci.14, 179–211 (1990)

1990
[50]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput.9, 1735–1780 (1997)

1997
[51]

A new approach to linear filtering and prediction problems,

R. E. Kalman, “A new approach to linear filtering and prediction problems,”J. Basic Eng.82, 35–45 (1960)

1960
[52]

B. D. O. Anderson and J. B. Moore,Optimal Filtering (Prentice-Hall, Englewood Cliffs, NJ, 1979)

1979
[53]

Lindquist and G

A. Lindquist and G. Picci,Linear Stochastic Systems: A Geometric Approach to Modeling, Estimation and Iden- tification, Springer, Berlin, 2015

2015
[54]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R´ e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations(ICLR, 2022)

2022
[55]

Simplified state space layers for sequence modeling,

J. T. H. Smith, A. Warrington, and S. W. Linderman, “Simplified state space layers for sequence modeling,” inInternational Conference on Learning Representations (ICLR, 2023)

2023
[56]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces,

A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” inConference on 26 Language Modeling(COLM), 2024

2024
[57]

Auto-encoding varia- tional Bayes,

D. P. Kingma and M. Welling, “Auto-encoding varia- tional Bayes,” inInternational Conference on Learning Representations(ICLR, 2014)

2014
[58]

ELBO surgery: yet another way to carve up the variational evidence lower bound,

M. D. Hoffman and M. J. Johnson, “ELBO surgery: yet another way to carve up the variational evidence lower bound,” inProceedings of the Workshop in Advances in Approximate Bayesian Inference, NIPS, Vol. 1 (2016)

2016
[59]

Fixing a broken ELBO,

A. A. Alemiet al., “Fixing a broken ELBO,” inPro- ceedings of the 35th International Conference on Machine Learning(ICML, 2018), pp. 159–168

2018
[60]

Dissipation: The phase-space perspective,

R. Kawai, J. M. R. Parrondo, and C. Van den Broeck, “Dissipation: The phase-space perspective,”Phys. Rev. Lett.98, 080602 (2007)

2007
[61]

Arrows of time for large language models,

V. Papadopoulos, J. Wenger, and C. Hongler, “Arrows of time for large language models,” inProceedings of the 41st International Conference on Machine Learning (ICML, 2024), pp. 39509–39528

2024
[62]

Reverse modeling in large language mod- els,

S. Yuet al., “Reverse modeling in large language mod- els,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computa- tional Linguistics: Human Language Technologies (Vol- ume 2: Short Papers), pp. 306–320, 2025

2025
[63]

Learned hallucination detection in black-box LLMs using token-level entropy production rate,

C. Moslonka, H. Randrianarivo, A. Garnier, and E. Mal- herbe, “Learned hallucination detection in black-box LLMs using token-level entropy production rate,” inAd- vances in Information Retrieval(ECIR 2026), Lecture Notes in Computer Science,16483, 115–130. Springer, Cham, 2026

2026
[64]

Infor- mation flows? A critique of transfer entropies,

R. G. James, N. Barnett, and J. P. Crutchfield, “Infor- mation flows? A critique of transfer entropies,”Phys. Rev. Lett.116, 238701 (2016)

2016
[65]

ExpliCa: Evaluating explicit causal reasoning in large language models,

M. Milianiet al., “ExpliCa: Evaluating explicit causal reasoning in large language models,” inFindings of the Association for Computational Linguistics: ACL 2025, pp. 17335–17355

2025
[66]

T. M. Cover and J. A. Thomas,Elements of Informa- tion Theory, 2nd ed. (Wiley-Interscience, Hoboken, NJ, 2006)

2006
[67]

Information and entropy flow in the Kalman–Bucy filter,

S. K. Mitter and N. J. Newton, “Information and entropy flow in the Kalman–Bucy filter,”J. Stat. Phys.118, 145– 176 (2005)

2005
[68]

Dual Kalman–Bucy filters and interactive entropy production,

N. J. Newton, “Dual Kalman–Bucy filters and interactive entropy production,”SIAM J. Control Optim.45, 998– 1016 (2006)

2006
[69]

Dual nonlinear filters and entropy pro- duction,

N. J. Newton, “Dual nonlinear filters and entropy pro- duction,”SIAM J. Control Optim.46, 1637–1663 (2007)

2007
[70]

Interactive statistical mechanics and non- linear filtering,

N. J. Newton, “Interactive statistical mechanics and non- linear filtering,”J. Stat. Phys.133, 711–737 (2008)

2008
[71]

Second-law-like in- equalities with information and their interpretations,

J. M. Horowitz and H. Sandberg, “Second-law-like in- equalities with information and their interpretations,” New J. Phys.16, 125007 (2014)

2014
[72]

Maximum work extraction and imple- mentation costs for nonequilibrium Maxwell’s demons,

H. Sandberg, J.-C. Delvenne, N. J. Newton, and S. K. Mitter, “Maximum work extraction and imple- mentation costs for nonequilibrium Maxwell’s demons,” Phys. Rev. E90, 042119 (2014)

2014
[73]

Role of sufficient statis- tics in stochastic thermodynamics and its implication to sensory adaptation,

T. Matsumoto and T. Sagawa, “Role of sufficient statis- tics in stochastic thermodynamics and its implication to sensory adaptation,”Phys. Rev. E97, 042103 (2018)

2018
[74]

Ther- modynamic uncertainty relation for feedback cooling,

K. Kumasaki, K. Tojo, T. Sagawa, and K. Funo, “Ther- modynamic uncertainty relation for feedback cooling,” Phys. Rev. E113, 024134 (2026)

2026
[75]

Characterising the nonequilibrium stationary states of Ornstein–Uhlenbeck processes,

C. Godr` eche and J. M. Luck, “Characterising the nonequilibrium stationary states of Ornstein–Uhlenbeck processes,”J. Phys. A: Math. Theor.52, 035002 (2019)

2019
[76]

En- tropy production in linear Langevin systems,

G. T. Landi, T. Tom´ e, and M. J. de Oliveira, “En- tropy production in linear Langevin systems,”J. Phys. A: Math. Theor.46, 395001 (2013)

2013
[77]

Entropy production of multivariate Ornstein–Uhlenbeck processes correlates with consciousness levels in the human brain,

M. Gilson, E. Tagliazucchi, and R. Cofr´ e, “Entropy production of multivariate Ornstein–Uhlenbeck processes correlates with consciousness levels in the human brain,” Phys. Rev. E107, 024121 (2023)

2023
[78]

Time-reversibility of linear stochastic pro- cesses,

G. Weiss, “Time-reversibility of linear stochastic pro- cesses,”J. Appl. Probab.12, 831–836 (1975)

1975
[79]

On time-reversibility of mul- tivariate linear processes,

H. Tong and Z. Zhang, “On time-reversibility of mul- tivariate linear processes,”Statist. Sinica15, 495–504 (2005)

2005
[80]

On time-reversibility of linear stochastic models,

T. T. Georgiou and A. Lindquist, “On time-reversibility of linear stochastic models,” inProceedings of the 19th IFAC World Congress(IFAC, Cape Town, 2014), pp. 10403–10408; arXiv:1309.0165 (2013)

work page arXiv 2014

Showing first 80 references.