arxiv: 2605.07844 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Distributional simplicity bias and effective convexity in Energy Based Models

Alfonso de Jes\'us Navas G\'omez, Aur\'elien Decelle, Beatriz Seoane

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords energy-based modelsgradient flowfixed pointsdistributional simplicity biasIsing modelsFourier expansionbinary distributionslearning dynamics

0 comments

The pith

Gradient flow in energy-based models on binary variables creates data-consistent fixed points after learning lower-order interactions before higher ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the training dynamics of energy-based models as a dynamical system on the space of distributions over binary variables. It establishes that when the model is expressive enough, gradient descent produces fixed points that either exactly match the target distribution or are spurious stationary states that fail to match it. Perturbations around the matching fixed points remain stable or neutral, with neutral directions preserving the effective representation of the model. The dynamics further impose a strict order on what gets learned, capturing low-order patterns first and deferring higher-order ones until later.

Core claim

Under sufficient expressivity, the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones.

What carries the argument

The effective model, viewed as a generalised Ising model with higher-order interactions or equivalently the Fourier expansion of the energy function, which encodes the state of the gradient flow and exposes the interaction-order hierarchy.

If this is right

Data-consistent fixed points act as stable or neutrally stable attractors under small perturbations.
Spurious fixed points satisfy stationarity yet are avoided in practice because the dynamics reach data consistency at low orders first.
The learning process exhibits a strict ordering of interactions from low to high degree.
Neutral perturbation directions around data-consistent points leave the effective model unchanged, preserving the learned distribution.
This ordering supplies a direct mechanistic reason that non-data-consistent fixed points at low orders are never encountered during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same ordering mechanism could be tested by monitoring the spectrum of interaction coefficients during training on synthetic Ising-like data with known interaction orders.
If the effective-model representation extends to continuous or high-dimensional settings, similar simplicity biases might appear in other gradient-trained generative models.
Practitioners might accelerate convergence by warm-starting with models restricted to low-order terms.
The neutral directions suggest that certain parameter redundancies could be exploited for more efficient parameterisation without changing the learned distribution.

Load-bearing premise

The model class must be expressive enough to follow the full gradient flow without hidden approximations, and targets must be strictly positive distributions over binary variables.

What would settle it

Training an energy-based model on a binary target distribution and finding that a higher-order interaction term reaches its target value before a lower-order term does would falsify the claimed learning hierarchy.

Figures

Figures reproduced from arXiv: 2605.07844 by Alfonso de Jes\'us Navas G\'omez, Aur\'elien Decelle, Beatriz Seoane.

**Figure 1.** Figure 1: Frobenius norm of the effective parameters, extracted from three distinct RBM trainings using the mapping of [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗

**Figure 2.** Figure 2: (Left) Covariance matrix of the data generated with the model in Eq. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

read the original abstract

Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean effective-model analysis of gradient flow in EBMs that explains the low-to-high order hierarchy and neutral directions around data-consistent points, but the translation from NN weights to those coefficients needs checking.

read the letter

The core result is that gradient flow on the Fourier (or generalized Ising) expansion of the energy admits data-consistent fixed points that are stable or neutral, plus spurious ones, and that lower-order terms converge before higher-order ones. This supplies a direct dynamical account of distributional simplicity bias without invoking external regularization arguments. The neutral directions that leave the effective model unchanged are a nice detail; they clarify why some perturbations do not destabilize the learned distribution in practice. The restriction to strictly positive binary targets keeps the math tractable and avoids zero-probability pathologies. That part of the work is internally consistent and worth having on record for the EBM literature. The main soft spot is the one flagged in the stress-test note. The analysis is performed directly on the effective coefficients, yet real models are neural networks whose parameters map to those coefficients through a non-constant, non-full-rank Jacobian. Without an explicit argument that this map preserves the fixed-point structure and timescale separation, the claimed dynamics on the actual trainable parameters remain an approximation. The paper does not appear to supply error bounds or counter-examples on this point, so the gap is real rather than cosmetic. The claims are also scoped to sufficient expressivity and strictly positive distributions, which is honest but narrows the immediate applicability. For readers working on the theory of energy-based or score-based generative models, the effective-model lens and the hierarchy result are useful even if the NN bridge needs tightening. A serious referee could verify the derivations, test whether the Jacobian issue alters the conclusions, and ask for a short numerical check on a small network. I would send it to peer review rather than desk-reject; the mechanistic story is sharp enough to merit external scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper claims that energy-based learning for strictly positive binary distributions can be analyzed via an 'effective model' (generalized Ising model or Fourier expansion of the energy). Under sufficient expressivity, gradient flow on this representation admits two classes of fixed points—data-consistent points that exactly match the target distribution and spurious points that satisfy stationarity without matching it—along with stability/neutrality properties around the data-consistent points and a dynamical hierarchy in which lower-order interactions are learned before higher-order ones. This is presented as a mechanistic account of the distributional simplicity bias in EBMs.

Significance. If the central claims hold, the work supplies a concrete dynamical explanation for why EBM training exhibits simplicity bias and why certain non-data-consistent fixed points are not observed in practice. The effective-model reduction is a useful interpretive device that could inform initialization, regularization, or architecture choices. The paper receives credit for attempting a parameter-free derivation of the hierarchy and fixed-point classification directly from the gradient flow, though the strength of these contributions depends on verifying the reduction from neural-network parameters.

major comments (2)

[analysis of gradient flow and fixed points in the effective model] The fixed-point classification and stability analysis (data-consistent vs. spurious points, neutral directions) are derived by treating the effective Fourier coefficients as the direct variables of gradient flow. When the energy is realized by a neural network, the actual parameter update composes this flow with the Jacobian of the map from network weights to Fourier coefficients; because this Jacobian is neither constant nor guaranteed full-rank, the induced dynamics on the effective coefficients can differ, potentially creating or destroying the claimed spurious fixed points and altering the timescale separation.
[derivation of the learning hierarchy] The hierarchy result (lower-order interactions learned before higher-order ones) is obtained from the eigenvalue structure or timescale separation in the effective-model flow. The same Jacobian issue applies: even under the sufficient-expressivity assumption, the composition with a non-constant Jacobian can change the ordering of learning timescales or introduce additional neutral directions not visible in the direct effective-model analysis.

minor comments (2)

[assumptions and setup] The precise definition of 'sufficient expressivity' and the conditions under which the effective model faithfully captures the NN-induced flow should be stated as an explicit assumption with a supporting lemma or reference.
[effective model definition] Notation for the Fourier coefficients and the effective energy could be made more explicit (e.g., an equation showing the exact mapping from the energy function E to its Fourier representation) to aid readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and insightful comments on the connection between the effective-model analysis and neural-network realizations. We address the two major comments point by point below, providing a substantive response while noting the revisions we will make.

read point-by-point responses

Referee: [analysis of gradient flow and fixed points in the effective model] The fixed-point classification and stability analysis (data-consistent vs. spurious points, neutral directions) are derived by treating the effective Fourier coefficients as the direct variables of gradient flow. When the energy is realized by a neural network, the actual parameter update composes this flow with the Jacobian of the map from network weights to Fourier coefficients; because this Jacobian is neither constant nor guaranteed full-rank, the induced dynamics on the effective coefficients can differ, potentially creating or destroying the claimed spurious fixed points and altering the timescale separation.

Authors: We agree that the parameterization introduces a Jacobian, but the fixed-point correspondence is preserved in one direction. Let phi(w) denote the map from network parameters w to effective Fourier coefficients. The loss depends on the distribution, hence on phi(w), so the parameter gradient is J^T times the effective gradient, where J = d phi / dw. Consequently, every point where the effective gradient vanishes remains a fixed point in parameter space; both the data-consistent and spurious fixed points of the effective model are therefore fixed points of the actual dynamics. The Jacobian can, however, introduce additional fixed points lying in the kernel of J^T and can modify local stability and neutral directions. Under the paper's sufficient-expressivity assumption we expect J to be locally full rank around the fixed points of interest, which would prevent extra fixed points and preserve the stability classification. We will revise the manuscript to state this explicitly, add a short derivation of the composed gradient, and note the conditions under which the effective-model results carry over exactly. This is a partial revision; a complete spectral analysis for arbitrary architectures lies beyond the present scope. revision: partial
Referee: [derivation of the learning hierarchy] The hierarchy result (lower-order interactions learned before higher-order ones) is obtained from the eigenvalue structure or timescale separation in the effective-model flow. The same Jacobian issue applies: even under the sufficient-expressivity assumption, the composition with a non-constant Jacobian can change the ordering of learning timescales or introduce additional neutral directions not visible in the direct effective-model analysis.

Authors: The induced flow on the effective coefficients is a Riemannian gradient flow whose metric is given by J J^T. Fixed points remain the same, but the local linearization and therefore the convergence rates are altered by this metric. If J J^T is approximately diagonal (or block-diagonal) with respect to interaction order, the timescale separation and the lower-to-higher learning order are preserved; otherwise the ordering could be modified. The sufficient-expressivity assumption does not automatically guarantee this diagonal structure, so the referee's concern is valid. We will add a paragraph explaining the composed dynamics, stating the additional assumption needed for the hierarchy to be invariant, and observing that the empirical simplicity bias reported in the literature is consistent with the effective-model prediction. This will be a partial revision. revision: partial

standing simulated objections not resolved

A general proof that the Jacobian J J^T preserves the exact ordering of learning timescales for arbitrary neural-network architectures realizing the energy function.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives fixed-point classes and the lower-to-higher order learning hierarchy directly from the gradient-flow equations written in the effective Fourier / generalized Ising coordinates. These coordinates are introduced as an exact reparametrization of the energy (under the maintained assumption of sufficient expressivity and strictly positive target distributions), not as a fitted model whose parameters are later renamed as predictions. No load-bearing step reduces to a self-citation chain, an ansatz smuggled via prior work, or a definitional identity; the stability/neutrality statements and timescale separation follow from the algebraic structure of the flow itself. The Jacobian concern raised by the skeptic is a question of approximation validity, not a circularity in the presented chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed from abstract alone; no explicit free parameters, axioms, or invented entities can be extracted. The effective model is presented as an interpretive device rather than a newly postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5484 in / 1307 out tokens · 50669 ms · 2026-05-11T02:20:42.549354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
Theorem 2. Convexity of higher-order Boltzmann Machine Learning. The negative log-likelihood function of a higher-order Boltzmann machine is convex. ... ∇²_ϕ(-L) = Cov_model(O)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

[1]

and Louis, Ard A

Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter- function map is biased towards simple functions.arXiv preprint arXiv:1805.08522, 2018

work page arXiv 2018
[2]

Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32, 2019

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32, 2019

work page 2019
[3]

Neural networks trained with sgd learn distributions of increasing complexity

Maria Refinetti, Alessandro Ingrosso, and Sebastian Goldt. Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR, 2023

work page 2023
[4]

A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems, 37:96207–96228, 2024

Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt. A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems, 37:96207–96228, 2024

work page 2024
[5]

Neural networks learn statistics of increasing complexity.arXiv preprint arXiv:2402.04362, 2024

Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. Neural networks learn statistics of increasing complexity.arXiv preprint arXiv:2402.04362, 2024

work page arXiv 2024
[6]

How transformers learn structured data: insights from hierarchical filtering.arXiv preprint arXiv:2408.15138, 2024

Jérôme Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: insights from hierarchical filtering.arXiv preprint arXiv:2408.15138, 2024

work page arXiv 2024
[7]

Compression of structured data with autoencoders: Provable benefit of nonlinearities and depth.arXiv preprint arXiv:2402.05013, 2024

Kevin Kögler, Alexander Shevchenko, Hamed Hassani, and Marco Mondelli. Compression of structured data with autoencoders: Provable benefit of nonlinearities and depth.arXiv preprint arXiv:2402.05013, 2024

work page arXiv 2024
[8]

Inferring effective couplings with restricted boltzmann machines.SciPost Physics, 16(4):095, 2024

Aurélien Decelle, Cyril Furtlehner, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Inferring effective couplings with restricted boltzmann machines.SciPost Physics, 16(4):095, 2024

work page 2024
[9]

Inferring higher-order couplings with neural networks.Physical Review Letters, 135(20):207301, 2025

Aurélien Decelle, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Inferring higher-order couplings with neural networks.Physical Review Letters, 135(20):207301, 2025

work page 2025
[10]

How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How composi- tional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025. 9 APREPRINT- MAY11, 2026

work page arXiv 2025
[11]

A theory of learning data statistics in diffusion models, from easy to hard.arXiv preprint arXiv:2603.12901, 2026

Lorenzo Bardone, Claudia Merger, and Sebastian Goldt. A theory of learning data statistics in diffusion models, from easy to hard.arXiv preprint arXiv:2603.12901, 2026

work page arXiv 2026
[12]

Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

work page 1995
[13]

A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

work page 2018
[14]

Learning protein constitutive motifs from sequence data

Jérôme Tubiana, Simona Cocco, and Rémi Monasson. Learning protein constitutive motifs from sequence data. Elife, 8:e39397, 2019

work page 2019
[15]

Uncovering statistical structure in large-scale neural activity with restricted boltzmann machines

Nicolas Béreux, Giovanni Catania, Aurélien Decelle, Francesca Mignacco, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Uncovering statistical structure in large-scale neural activity with restricted boltzmann machines. arXiv preprint arXiv:2603.11032, 2026

work page arXiv 2026
[16]

The loss surfaces of multilayer networks.Proceedings of AISTATS, 2015

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks.Proceedings of AISTATS, 2015

work page 2015
[17]

Escaping from saddle points—online stochastic gradient for tensor decomposition

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InConference on learning theory, pages 797–842. PMLR, 2015

work page 2015
[18]

On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points.Journal of the ACM (JACM), 68(2):1–29, 2021

Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points.Journal of the ACM (JACM), 68(2):1–29, 2021

work page 2021
[19]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

work page 2018
[20]

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019

work page 2019
[21]

Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion

Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. InInternational conference on machine learning, pages 3345–3354. PMLR, 2018

work page 2018
[22]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

work page internal anchor Pith review arXiv 2016
[23]

Exact training of restricted boltzmann machines on intrinsically low dimensional data.Physical Review Letters, 127(15):158303, 2021

Aurélien Decelle and Cyril Furtlehner. Exact training of restricted boltzmann machines on intrinsically low dimensional data.Physical Review Letters, 127(15):158303, 2021

work page 2021
[24]

On the anatomy of mcmc-based maximum likelihood learning of energy-based models.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

Erik Nijkamp, Michael Hill, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[25]

Exact training of restricted boltzmann machines on intrinsically low- dimensional data.Physical Review Letters, 127:158303, 2021

Aurélien Decelle and Cyril Furtlehner. Exact training of restricted boltzmann machines on intrinsically low- dimensional data.Physical Review Letters, 127:158303, 2021

work page 2021
[26]

Explaining the effects of non- convergent MCMC in the training of energy-based models

Elisabeth Agoritsas, Giovanni Catania, Aurélien Decelle, and Beatriz Seoane. Explaining the effects of non- convergent MCMC in the training of energy-based models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 2...

work page 2023
[27]

Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

work page 2008
[28]

Refinements of universal approximation results for deep belief networks and restricted boltzmann machines.Neural computation, 23(5):1306–1319, 2011

Guido Montufar and Nihat Ay. Refinements of universal approximation results for deep belief networks and restricted boltzmann machines.Neural computation, 23(5):1306–1319, 2011

work page 2011
[29]

Expressive power and approximation errors of restricted boltzmann machines.Advances in neural information processing systems, 24, 2011

Guido F Montúfar, Johannes Rauh, and Nihat Ay. Expressive power and approximation errors of restricted boltzmann machines.Advances in neural information processing systems, 24, 2011

work page 2011
[30]

Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541– 551, 1989

Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541– 551, 1989

work page 1989
[31]

Neuropixels visual coding (dataset) 2019

Allen Institute MindScope Program Allen Brain Observatory. Neuropixels visual coding (dataset) 2019. https://brain-map.org/explore/circuits

work page 2019
[32]

Cambridge University Press, 2014

Ryan O’Donnell.Analysis of Boolean Functions. Cambridge University Press, 2014

work page 2014
[33]

Mnist handwritten digit database

Yann LeCun and Corinna Cortes. Mnist handwritten digit database. http://yann.lecun.com/exdb/mnist/, 1998. 10 APREPRINT- MAY11, 2026

work page 1998
[34]

Fast training and sampling of restricted boltzmann machines.arXiv preprint arXiv:2405.15376, 2024

Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Lorenzo Rosset, and Beatriz Seoane. Fast training and sampling of restricted boltzmann machines.arXiv preprint arXiv:2405.15376, 2024

work page arXiv 2024
[35]

Training energy-based models with parallel trajectory tempering

Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, and Beatriz Seoane. Training energy-based models with parallel trajectory tempering. InEurIPS 2025 Workshop on Principles of Generative Modeling (PriGM). A Elements of Pseudo-Boolean function analysis This appendix collects the theorems of pseudo-Boolean function analysis that underpin the results presen...

work page 2025