Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

David R\"ugamer; Emanuel Sommer; Jakob Robnik; Kangning Diao; Uros Seljak

arxiv: 2602.06500 · v2 · pith:V4LBBSO4new · submitted 2026-02-06 · 💻 cs.LG

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

Emanuel Sommer , Kangning Diao , Jakob Robnik , Uros Seljak , David R\"ugamer This is my paper

Pith reviewed 2026-05-21 13:30 UTC · model grok-4.3

classification 💻 cs.LG

keywords microcanonical Langevin dynamicsmini-batch gradientsstochastic gradient MCMCBayesian neural networksscalable Bayesian inferenceadaptive step size tuning

0 comments

The pith

Microcanonical Langevin dynamics can leverage mini-batch gradient noise with preconditioning and adaptive tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether microcanonical Langevin Monte Carlo can scale to large datasets by replacing full gradients with cheaper mini-batch versions. A continuous-time analysis reveals that anisotropic mini-batch noise creates a systematic bias and triggers numerical instabilities in high-dimensional posteriors. The authors introduce a gradient noise preconditioning scheme to shrink that bias together with an energy-variance adaptive tuner that selects step sizes and enforces stability. If these corrections hold, the resulting sampler becomes practical for large-scale Bayesian inference without the cost of full-dataset gradients.

Core claim

Stochastic-gradient microcanonical dynamics exhibit bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors; a principled gradient noise preconditioning scheme reduces this bias while an energy-variance-based adaptive tuner automates step size selection and supplies numerical guardrails, producing a robust scalable microcanonical Monte Carlo sampler that reaches state-of-the-art performance on tasks such as Bayesian neural networks.

What carries the argument

Gradient noise preconditioning scheme combined with energy-variance-based adaptive tuner for microcanonical dynamics.

If this is right

Microcanonical Monte Carlo sampling becomes feasible for large models using only mini-batch gradients.
The sampler achieves state-of-the-art results on challenging high-dimensional tasks such as Bayesian neural networks.
Combined with ensemble techniques the method yields a new class of stochastic microcanonical Langevin ensemble samplers for large-scale Bayesian inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preconditioning approach may reduce bias in other stochastic-gradient MCMC algorithms.
The energy-variance tuner could be tested on sampling problems outside neural network posteriors to check generality.
Similar variance-based adaptation might improve stability in related mini-batch optimization settings.

Load-bearing premise

The gradient noise preconditioning scheme and energy-variance-based adaptive tuner sufficiently mitigate bias and numerical instabilities when applied to complex high-dimensional posteriors.

What would settle it

Running the sampler on a high-dimensional Bayesian neural network posterior and finding either persistent bias relative to the full-gradient version or frequent numerical instabilities would show that the fixes do not work as claimed.

read the original abstract

Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Continuous-time analysis of mini-batch microcanonical Langevin is new, but discrete-time bias and high-dim stability need more work.

read the letter

The punchline is that this work gives the first continuous-time treatment of stochastic-gradient microcanonical Langevin dynamics and proposes concrete fixes for the bias and instability problems that come with mini-batches. They start by writing down the continuous-time SDE with noisy gradients and derive the extra drift term caused by the anisotropy in the noise. That part looks clean. Then they introduce a preconditioning matrix that depends on the gradient noise covariance to cancel the bias. On top of that they add an adaptive mechanism that watches the variance of the energy and adjusts the step size on the fly while adding guardrails against blow-ups. The abstract says this leads to a sampler that works on Bayesian neural nets at scale, and they mention combining it with ensembles to get SMILE samplers. The analysis is new relative to earlier microcanonical papers, and the preconditioner is a reasonable idea that directly targets the derived bias. If the experiments show clear gains over standard SGLD or other mini-batch MCMC on large models, that would be solid. The main concern is that everything is derived in continuous time. The actual code runs a discrete integrator, and the local errors from that discretization can interact with the remaining noise in ways the continuous derivation does not control. There is no explicit bound showing that the preconditioner restores the correct invariant at finite step sizes, nor a stability argument for the non-convex, high-curvature regions typical in neural network posteriors. The adaptive tuner helps with step-size choice, but it is not obvious it prevents divergence when the landscape has many saddle points or flat directions. This paper is for people who care about practical Bayesian inference in deep learning. Someone already working on stochastic gradient MCMC will get value from the bias derivation and the new tuner. It is worth sending to peer review because the question is timely and the proposals are specific enough that referees can check the math and the experiments directly. Recommendation: Put it through review, but ask the authors for more on the discrete-time error and high-dimensional stability.

Referee Report

2 major / 2 minor

Summary. The paper asks whether microcanonical Langevin dynamics can be made to work with mini-batch gradients. It supplies a continuous-time analysis that identifies an anisotropic bias induced by stochastic gradients and numerical instabilities in high dimensions, then introduces a gradient-noise preconditioner claimed to reduce the bias together with an energy-variance adaptive tuner for step-size selection. The resulting algorithm is asserted to be a robust, scalable sampler that reaches state-of-the-art performance on Bayesian neural networks and, when combined with ensembles, yields a new class of SMILE samplers.

Significance. If the preconditioner and tuner provably restore the correct invariant measure at practical step sizes and remain stable on non-convex high-dimensional posteriors, the work would materially advance scalable MCMC for Bayesian deep learning. The continuous-time derivation and the explicit identification of failure modes are useful contributions even if the discrete-time guarantees require further work.

major comments (2)

[§3 and Algorithm 1] The continuous-time bias derivation (abstract and §3) is performed in the infinitesimal-step limit. The implemented sampler uses a discrete Euler–Maruyama scheme whose local truncation error interacts with the anisotropic noise; no explicit bound on the resulting invariant-measure error or Lyapunov argument is supplied to show that the preconditioner restores the correct stationary distribution at finite step sizes.
[§4.2 and §5] The energy-variance tuner and numerical guardrails are motivated by the observed instabilities, yet the manuscript provides no high-dimensional Lyapunov or moment-control analysis demonstrating that the adaptive scheme prevents blow-up on the non-convex, high-curvature landscapes typical of Bayesian neural networks.

minor comments (2)

[Algorithm 1] Notation for the preconditioner matrix and the energy-variance statistic should be introduced with explicit definitions before their first use in the algorithm box.
[§6] The experimental section would benefit from an ablation that isolates the contribution of the preconditioner versus the adaptive tuner on the reported BNN tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment below, clarifying the scope of our contributions while acknowledging limitations where appropriate.

read point-by-point responses

Referee: [§3 and Algorithm 1] The continuous-time bias derivation (abstract and §3) is performed in the infinitesimal-step limit. The implemented sampler uses a discrete Euler–Maruyama scheme whose local truncation error interacts with the anisotropic noise; no explicit bound on the resulting invariant-measure error or Lyapunov argument is supplied to show that the preconditioner restores the correct stationary distribution at finite step sizes.

Authors: We agree that the bias analysis in Section 3 is derived under the continuous-time, infinitesimal step-size limit. This framework enables an explicit identification of the anisotropic bias arising from mini-batch gradient noise and motivates the design of the preconditioner. The implemented algorithm employs a discrete Euler–Maruyama discretization, and we do not supply a rigorous bound on the invariant-measure discrepancy or a Lyapunov argument for finite step sizes. Our defense rests on the empirical evidence in Sections 4 and 5, where the preconditioned sampler exhibits reduced bias and stable performance on high-dimensional Bayesian neural network posteriors. We will revise the manuscript to include an explicit discussion of the continuous-time approximation and the reliance on empirical validation for practical step sizes. revision: partial
Referee: [§4.2 and §5] The energy-variance tuner and numerical guardrails are motivated by the observed instabilities, yet the manuscript provides no high-dimensional Lyapunov or moment-control analysis demonstrating that the adaptive scheme prevents blow-up on the non-convex, high-curvature landscapes typical of Bayesian neural networks.

Authors: The energy-variance tuner and guardrails in Section 4.2 are developed from the instabilities observed when applying microcanonical dynamics to mini-batch gradients in high dimensions. While the manuscript does not contain a high-dimensional Lyapunov or moment-control analysis for non-convex landscapes, the adaptive scheme is shown through the experiments in Section 5 to maintain numerical stability across the tested Bayesian neural network models. We acknowledge that a theoretical stability guarantee would strengthen the claims and will add a paragraph in the discussion section noting this limitation and outlining it as an avenue for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity: derivation relies on new continuous-time analysis and proposals independent of inputs

full rationale

The paper establishes a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics, derives specific failure modes (anisotropic bias and instabilities), and introduces a preconditioning scheme plus energy-variance tuner to address them. These steps are presented as independent contributions rather than reductions to prior fits, self-definitions, or self-citation chains. No equations or claims in the provided text reduce a prediction or result to an input by construction; the central sampler proposal follows from the new analysis and is evaluated on external tasks like Bayesian neural networks. This is the common honest outcome for papers with fresh theoretical derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level introduction of SMILE samplers.

invented entities (1)

SMILE samplers no independent evidence
purpose: stochastic microcanonical Langevin ensemble samplers for large-scale Bayesian inference
New class of samplers formed by combining the proposed method with recent ensemble techniques.

pith-pipeline@v0.9.0 · 5754 in / 1045 out tokens · 42463 ms · 2026-05-21T13:30:02.318467+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish a rigorous theoretical foundation for microcanonical dynamics under stochasticity; specifically, we formally derive the systematic bias induced by anisotropic gradient noise and mathematically prove that a principled preconditioning scheme eliminates the resulting noise-induced drift.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we model the distribution of |ΔE| in an online fashion using a Gamma distribution... adapt the step size multiplicatively based on where |ΔE| lies relative to the Gamma quantiles

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.