pith. machine review for the scientific record. sign in

arxiv: 2605.07844 · v1 · submitted 2026-05-08 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Distributional simplicity bias and effective convexity in Energy Based Models

Alfonso de Jes\'us Navas G\'omez, Aur\'elien Decelle, Beatriz Seoane

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:20 UTC · model grok-4.3

classification 💻 cs.LG
keywords energy-based modelsgradient flowfixed pointsdistributional simplicity biasIsing modelsFourier expansionbinary distributionslearning dynamics
0
0 comments X

The pith

Gradient flow in energy-based models on binary variables creates data-consistent fixed points after learning lower-order interactions before higher ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the training dynamics of energy-based models as a dynamical system on the space of distributions over binary variables. It establishes that when the model is expressive enough, gradient descent produces fixed points that either exactly match the target distribution or are spurious stationary states that fail to match it. Perturbations around the matching fixed points remain stable or neutral, with neutral directions preserving the effective representation of the model. The dynamics further impose a strict order on what gets learned, capturing low-order patterns first and deferring higher-order ones until later.

Core claim

Under sufficient expressivity, the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones.

What carries the argument

The effective model, viewed as a generalised Ising model with higher-order interactions or equivalently the Fourier expansion of the energy function, which encodes the state of the gradient flow and exposes the interaction-order hierarchy.

If this is right

  • Data-consistent fixed points act as stable or neutrally stable attractors under small perturbations.
  • Spurious fixed points satisfy stationarity yet are avoided in practice because the dynamics reach data consistency at low orders first.
  • The learning process exhibits a strict ordering of interactions from low to high degree.
  • Neutral perturbation directions around data-consistent points leave the effective model unchanged, preserving the learned distribution.
  • This ordering supplies a direct mechanistic reason that non-data-consistent fixed points at low orders are never encountered during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same ordering mechanism could be tested by monitoring the spectrum of interaction coefficients during training on synthetic Ising-like data with known interaction orders.
  • If the effective-model representation extends to continuous or high-dimensional settings, similar simplicity biases might appear in other gradient-trained generative models.
  • Practitioners might accelerate convergence by warm-starting with models restricted to low-order terms.
  • The neutral directions suggest that certain parameter redundancies could be exploited for more efficient parameterisation without changing the learned distribution.

Load-bearing premise

The model class must be expressive enough to follow the full gradient flow without hidden approximations, and targets must be strictly positive distributions over binary variables.

What would settle it

Training an energy-based model on a binary target distribution and finding that a higher-order interaction term reaches its target value before a lower-order term does would falsify the claimed learning hierarchy.

Figures

Figures reproduced from arXiv: 2605.07844 by Alfonso de Jes\'us Navas G\'omez, Aur\'elien Decelle, Beatriz Seoane.

Figure 1
Figure 1. Figure 1: Frobenius norm of the effective parameters, extracted from three distinct RBM trainings using the mapping of [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Covariance matrix of the data generated with the model in Eq. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Energy-based learning is a powerful framework for generative modelling, but its training is inherently non-convex, leading potentially to sensitivity to initialisation, poor local optima, and unstable gradient dynamics. We present a dynamical analysis of energy-based learning through the lens of the effective model, which can be interpreted as either a generalised Ising model with higher-order interactions or the Fourier expansion of the energy. Under sufficient expressivity, we show that the gradient flow induced by learning strictly positive distributions over binary variables admits two types of fixed points: data-consistent points, which exactly reproduce the target distribution, and spurious points, which satisfy stationarity without matching the target distribution. Around data-consistent points, we show that perturbations are either stable or neutral, with neutral directions leaving the effective model invariant. Finally, we show that gradient dynamics induce a hierarchy in which lower-order interactions are learned before higher-order ones. This provides a mechanistic explanation for the distributional simplicity bias and clarifies why fixed points that are not data-consistent at low orders are not observed in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that energy-based learning for strictly positive binary distributions can be analyzed via an 'effective model' (generalized Ising model or Fourier expansion of the energy). Under sufficient expressivity, gradient flow on this representation admits two classes of fixed points—data-consistent points that exactly match the target distribution and spurious points that satisfy stationarity without matching it—along with stability/neutrality properties around the data-consistent points and a dynamical hierarchy in which lower-order interactions are learned before higher-order ones. This is presented as a mechanistic account of the distributional simplicity bias in EBMs.

Significance. If the central claims hold, the work supplies a concrete dynamical explanation for why EBM training exhibits simplicity bias and why certain non-data-consistent fixed points are not observed in practice. The effective-model reduction is a useful interpretive device that could inform initialization, regularization, or architecture choices. The paper receives credit for attempting a parameter-free derivation of the hierarchy and fixed-point classification directly from the gradient flow, though the strength of these contributions depends on verifying the reduction from neural-network parameters.

major comments (2)
  1. [analysis of gradient flow and fixed points in the effective model] The fixed-point classification and stability analysis (data-consistent vs. spurious points, neutral directions) are derived by treating the effective Fourier coefficients as the direct variables of gradient flow. When the energy is realized by a neural network, the actual parameter update composes this flow with the Jacobian of the map from network weights to Fourier coefficients; because this Jacobian is neither constant nor guaranteed full-rank, the induced dynamics on the effective coefficients can differ, potentially creating or destroying the claimed spurious fixed points and altering the timescale separation.
  2. [derivation of the learning hierarchy] The hierarchy result (lower-order interactions learned before higher-order ones) is obtained from the eigenvalue structure or timescale separation in the effective-model flow. The same Jacobian issue applies: even under the sufficient-expressivity assumption, the composition with a non-constant Jacobian can change the ordering of learning timescales or introduce additional neutral directions not visible in the direct effective-model analysis.
minor comments (2)
  1. [assumptions and setup] The precise definition of 'sufficient expressivity' and the conditions under which the effective model faithfully captures the NN-induced flow should be stated as an explicit assumption with a supporting lemma or reference.
  2. [effective model definition] Notation for the Fourier coefficients and the effective energy could be made more explicit (e.g., an equation showing the exact mapping from the energy function E to its Fourier representation) to aid readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading and insightful comments on the connection between the effective-model analysis and neural-network realizations. We address the two major comments point by point below, providing a substantive response while noting the revisions we will make.

read point-by-point responses
  1. Referee: [analysis of gradient flow and fixed points in the effective model] The fixed-point classification and stability analysis (data-consistent vs. spurious points, neutral directions) are derived by treating the effective Fourier coefficients as the direct variables of gradient flow. When the energy is realized by a neural network, the actual parameter update composes this flow with the Jacobian of the map from network weights to Fourier coefficients; because this Jacobian is neither constant nor guaranteed full-rank, the induced dynamics on the effective coefficients can differ, potentially creating or destroying the claimed spurious fixed points and altering the timescale separation.

    Authors: We agree that the parameterization introduces a Jacobian, but the fixed-point correspondence is preserved in one direction. Let phi(w) denote the map from network parameters w to effective Fourier coefficients. The loss depends on the distribution, hence on phi(w), so the parameter gradient is J^T times the effective gradient, where J = d phi / dw. Consequently, every point where the effective gradient vanishes remains a fixed point in parameter space; both the data-consistent and spurious fixed points of the effective model are therefore fixed points of the actual dynamics. The Jacobian can, however, introduce additional fixed points lying in the kernel of J^T and can modify local stability and neutral directions. Under the paper's sufficient-expressivity assumption we expect J to be locally full rank around the fixed points of interest, which would prevent extra fixed points and preserve the stability classification. We will revise the manuscript to state this explicitly, add a short derivation of the composed gradient, and note the conditions under which the effective-model results carry over exactly. This is a partial revision; a complete spectral analysis for arbitrary architectures lies beyond the present scope. revision: partial

  2. Referee: [derivation of the learning hierarchy] The hierarchy result (lower-order interactions learned before higher-order ones) is obtained from the eigenvalue structure or timescale separation in the effective-model flow. The same Jacobian issue applies: even under the sufficient-expressivity assumption, the composition with a non-constant Jacobian can change the ordering of learning timescales or introduce additional neutral directions not visible in the direct effective-model analysis.

    Authors: The induced flow on the effective coefficients is a Riemannian gradient flow whose metric is given by J J^T. Fixed points remain the same, but the local linearization and therefore the convergence rates are altered by this metric. If J J^T is approximately diagonal (or block-diagonal) with respect to interaction order, the timescale separation and the lower-to-higher learning order are preserved; otherwise the ordering could be modified. The sufficient-expressivity assumption does not automatically guarantee this diagonal structure, so the referee's concern is valid. We will add a paragraph explaining the composed dynamics, stating the additional assumption needed for the hierarchy to be invariant, and observing that the empirical simplicity bias reported in the literature is consistent with the effective-model prediction. This will be a partial revision. revision: partial

standing simulated objections not resolved
  • A general proof that the Jacobian J J^T preserves the exact ordering of learning timescales for arbitrary neural-network architectures realizing the energy function.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives fixed-point classes and the lower-to-higher order learning hierarchy directly from the gradient-flow equations written in the effective Fourier / generalized Ising coordinates. These coordinates are introduced as an exact reparametrization of the energy (under the maintained assumption of sufficient expressivity and strictly positive target distributions), not as a fitted model whose parameters are later renamed as predictions. No load-bearing step reduces to a self-citation chain, an ansatz smuggled via prior work, or a definitional identity; the stability/neutrality statements and timescale separation follow from the algebraic structure of the flow itself. The Jacobian concern raised by the skeptic is a question of approximation validity, not a circularity in the presented chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed from abstract alone; no explicit free parameters, axioms, or invented entities can be extracted. The effective model is presented as an interpretive device rather than a newly postulated entity with independent evidence.

pith-pipeline@v0.9.0 · 5484 in / 1307 out tokens · 50669 ms · 2026-05-11T02:20:42.549354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 1 internal anchor

  1. [1]

    and Louis, Ard A

    Guillermo Valle-Perez, Chico Q Camargo, and Ard A Louis. Deep learning generalizes because the parameter- function map is biased towards simple functions.arXiv preprint arXiv:1805.08522, 2018

  2. [2]

    Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32, 2019

    Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity.Advances in neural information processing systems, 32, 2019

  3. [3]

    Neural networks trained with sgd learn distributions of increasing complexity

    Maria Refinetti, Alessandro Ingrosso, and Sebastian Goldt. Neural networks trained with sgd learn distributions of increasing complexity. InInternational Conference on Machine Learning, pages 28843–28863. PMLR, 2023

  4. [4]

    A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems, 37:96207–96228, 2024

    Riccardo Rende, Federica Gerace, Alessandro Laio, and Sebastian Goldt. A distributional simplicity bias in the learning dynamics of transformers.Advances in Neural Information Processing Systems, 37:96207–96228, 2024

  5. [5]

    Neural networks learn statistics of increasing complexity.arXiv preprint arXiv:2402.04362, 2024

    Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, and Xiaoli Fern. Neural networks learn statistics of increasing complexity.arXiv preprint arXiv:2402.04362, 2024

  6. [6]

    How transformers learn structured data: insights from hierarchical filtering.arXiv preprint arXiv:2408.15138, 2024

    Jérôme Garnier-Brun, Marc Mézard, Emanuele Moscato, and Luca Saglietti. How transformers learn structured data: insights from hierarchical filtering.arXiv preprint arXiv:2408.15138, 2024

  7. [7]

    Compression of structured data with autoencoders: Provable benefit of nonlinearities and depth.arXiv preprint arXiv:2402.05013, 2024

    Kevin Kögler, Alexander Shevchenko, Hamed Hassani, and Marco Mondelli. Compression of structured data with autoencoders: Provable benefit of nonlinearities and depth.arXiv preprint arXiv:2402.05013, 2024

  8. [8]

    Inferring effective couplings with restricted boltzmann machines.SciPost Physics, 16(4):095, 2024

    Aurélien Decelle, Cyril Furtlehner, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Inferring effective couplings with restricted boltzmann machines.SciPost Physics, 16(4):095, 2024

  9. [9]

    Inferring higher-order couplings with neural networks.Physical Review Letters, 135(20):207301, 2025

    Aurélien Decelle, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Inferring higher-order couplings with neural networks.Physical Review Letters, 135(20):207301, 2025

  10. [10]

    How compositional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025

    Alessandro Favero, Antonio Sclocchi, Francesco Cagnetta, Pascal Frossard, and Matthieu Wyart. How composi- tional generalization and creativity improve as diffusion models are trained.arXiv preprint arXiv:2502.12089, 2025. 9 APREPRINT- MAY11, 2026

  11. [11]

    A theory of learning data statistics in diffusion models, from easy to hard.arXiv preprint arXiv:2603.12901, 2026

    Lorenzo Bardone, Claudia Merger, and Sebastian Goldt. A theory of learning data statistics in diffusion models, from easy to hard.arXiv preprint arXiv:2603.12901, 2026

  12. [12]

    Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

    David Saad and Sara A Solla. Exact solution for on-line learning in multilayer neural networks.Physical Review Letters, 74(21):4337, 1995

  13. [13]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018

  14. [14]

    Learning protein constitutive motifs from sequence data

    Jérôme Tubiana, Simona Cocco, and Rémi Monasson. Learning protein constitutive motifs from sequence data. Elife, 8:e39397, 2019

  15. [15]

    Uncovering statistical structure in large-scale neural activity with restricted boltzmann machines

    Nicolas Béreux, Giovanni Catania, Aurélien Decelle, Francesca Mignacco, Alfonso de Jesús Navas Gómez, and Beatriz Seoane. Uncovering statistical structure in large-scale neural activity with restricted boltzmann machines. arXiv preprint arXiv:2603.11032, 2026

  16. [16]

    The loss surfaces of multilayer networks.Proceedings of AISTATS, 2015

    Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks.Proceedings of AISTATS, 2015

  17. [17]

    Escaping from saddle points—online stochastic gradient for tensor decomposition

    Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. InConference on learning theory, pages 797–842. PMLR, 2015

  18. [18]

    On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points.Journal of the ACM (JACM), 68(2):1–29, 2021

    Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points.Journal of the ACM (JACM), 68(2):1–29, 2021

  19. [19]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  20. [20]

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit

    Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. InConference on learning theory, pages 2388–2464. PMLR, 2019

  21. [21]

    Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion

    Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. InInternational conference on machine learning, pages 3345–3354. PMLR, 2018

  22. [22]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization.arXiv preprint arXiv:1611.03530, 2016

  23. [23]

    Exact training of restricted boltzmann machines on intrinsically low dimensional data.Physical Review Letters, 127(15):158303, 2021

    Aurélien Decelle and Cyril Furtlehner. Exact training of restricted boltzmann machines on intrinsically low dimensional data.Physical Review Letters, 127(15):158303, 2021

  24. [24]

    On the anatomy of mcmc-based maximum likelihood learning of energy-based models.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

    Erik Nijkamp, Michael Hill, Song-Chun Zhu, and Ying Nian Wu. On the anatomy of mcmc-based maximum likelihood learning of energy-based models.Proceedings of the AAAI Conference on Artificial Intelligence, 2020

  25. [25]

    Exact training of restricted boltzmann machines on intrinsically low- dimensional data.Physical Review Letters, 127:158303, 2021

    Aurélien Decelle and Cyril Furtlehner. Exact training of restricted boltzmann machines on intrinsically low- dimensional data.Physical Review Letters, 127:158303, 2021

  26. [26]

    Explaining the effects of non- convergent MCMC in the training of energy-based models

    Elisabeth Agoritsas, Giovanni Catania, Aurélien Decelle, and Beatriz Seoane. Explaining the effects of non- convergent MCMC in the training of energy-based models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 2...

  27. [27]

    Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

    Nicolas Le Roux and Yoshua Bengio. Representational power of restricted boltzmann machines and deep belief networks.Neural computation, 20(6):1631–1649, 2008

  28. [28]

    Refinements of universal approximation results for deep belief networks and restricted boltzmann machines.Neural computation, 23(5):1306–1319, 2011

    Guido Montufar and Nihat Ay. Refinements of universal approximation results for deep belief networks and restricted boltzmann machines.Neural computation, 23(5):1306–1319, 2011

  29. [29]

    Expressive power and approximation errors of restricted boltzmann machines.Advances in neural information processing systems, 24, 2011

    Guido F Montúfar, Johannes Rauh, and Nihat Ay. Expressive power and approximation errors of restricted boltzmann machines.Advances in neural information processing systems, 24, 2011

  30. [30]

    Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541– 551, 1989

    Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition.Neural computation, 1(4):541– 551, 1989

  31. [31]

    Neuropixels visual coding (dataset) 2019

    Allen Institute MindScope Program Allen Brain Observatory. Neuropixels visual coding (dataset) 2019. https://brain-map.org/explore/circuits

  32. [32]

    Cambridge University Press, 2014

    Ryan O’Donnell.Analysis of Boolean Functions. Cambridge University Press, 2014

  33. [33]

    Mnist handwritten digit database

    Yann LeCun and Corinna Cortes. Mnist handwritten digit database. http://yann.lecun.com/exdb/mnist/, 1998. 10 APREPRINT- MAY11, 2026

  34. [34]

    Fast training and sampling of restricted boltzmann machines.arXiv preprint arXiv:2405.15376, 2024

    Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, Lorenzo Rosset, and Beatriz Seoane. Fast training and sampling of restricted boltzmann machines.arXiv preprint arXiv:2405.15376, 2024

  35. [35]

    Training energy-based models with parallel trajectory tempering

    Nicolas Béreux, Aurélien Decelle, Cyril Furtlehner, and Beatriz Seoane. Training energy-based models with parallel trajectory tempering. InEurIPS 2025 Workshop on Principles of Generative Modeling (PriGM). A Elements of Pseudo-Boolean function analysis This appendix collects the theorems of pseudo-Boolean function analysis that underpin the results presen...