Subspace Inference for Bayesian Deep Learning

Andrew Gordon Wilson; Dmitry Vetrov; Pavel Izmailov; Polina Kirichenko; Timur Garipov; Wesley J. Maddox

arxiv: 1907.07504 · v1 · pith:ROITMKBGnew · submitted 2019-07-17 · 💻 cs.LG · stat.ML

Subspace Inference for Bayesian Deep Learning

Pavel Izmailov , Wesley J. Maddox , Polina Kirichenko , Timur Garipov , Dmitry Vetrov , Andrew Gordon Wilson This is my paper

Pith reviewed 2026-05-24 20:18 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords Bayesian inferencedeep learningsubspace methodsuncertainty estimationneural networksvariational inferenceslice samplingSGD trajectory

0 comments

The pith

Bayesian model averaging over posteriors in low-dimensional subspaces of neural network parameters yields accurate predictions with well-calibrated uncertainty.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build low-dimensional subspaces of neural network parameter space, for example by taking the leading principal components of an SGD optimization trajectory. These subspaces hold many different high-performing models, so standard Bayesian tools such as elliptical slice sampling and variational inference can be run inside them even though the same tools fail in the full high-dimensional space. Bayesian model averaging over the resulting posterior then produces both accurate point predictions and reliable uncertainty estimates on regression and image-classification tasks.

Core claim

Low-dimensional subspaces of parameter space, such as the first principal components of the SGD trajectory, contain diverse sets of high performing models; elliptical slice sampling and variational inference can therefore be applied inside these subspaces, and Bayesian model averaging over the induced posterior produces accurate predictions and well calibrated predictive uncertainty for both regression and image classification.

What carries the argument

Low-dimensional subspaces of parameter space (e.g., leading principal components of the SGD trajectory) that contain diverse high-performing models.

If this is right

Elliptical slice sampling becomes practical for deep networks when restricted to the subspace.
Variational inference likewise becomes tractable inside the same low-dimensional region.
The resulting posterior yields both accurate predictions and well-calibrated uncertainty on regression problems.
The same posterior yields accurate predictions and well-calibrated uncertainty on image classification problems.
Bayesian model averaging over the subspace posterior is feasible without requiring full-space inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same subspace construction might be applied to other optimization paths such as those produced by Adam or natural-gradient methods.
Subspace posteriors could be combined with existing approximate inference techniques that already scale to moderate dimensions.
If the subspaces remain effective across different architectures, the method would reduce the computational barrier to Bayesian deep learning in practice.

Load-bearing premise

Low-dimensional subspaces of parameter space contain diverse sets of high-performing models.

What would settle it

If Bayesian model averaging performed inside the proposed subspaces produces predictions or uncertainty estimates that are no better than, or worse than, standard point-estimate training on the same regression and classification benchmarks, the central claim would be falsified.

read the original abstract

Bayesian inference was once a gold standard for learning with neural networks, providing accurate full predictive distributions and well calibrated uncertainty. However, scaling Bayesian inference techniques to deep neural networks is challenging due to the high dimensionality of the parameter space. In this paper, we construct low-dimensional subspaces of parameter space, such as the first principal components of the stochastic gradient descent (SGD) trajectory, which contain diverse sets of high performing models. In these subspaces, we are able to apply elliptical slice sampling and variational inference, which struggle in the full parameter space. We show that Bayesian model averaging over the induced posterior in these subspaces produces accurate predictions and well calibrated predictive uncertainty for both regression and image classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows Bayesian inference can run in low-dim subspaces from SGD trajectories and produce calibrated uncertainty, but the key question is whether those subspaces actually hold enough predictive diversity.

read the letter

The main takeaway is that this work constructs low-dimensional subspaces using the top principal components of the SGD trajectory and runs elliptical slice sampling or variational inference inside them. This lets them do Bayesian model averaging for networks where full-space methods fail, and the experiments on regression and image classification report accurate predictions with good uncertainty calibration. That construction is the concrete new piece relative to standard full-parameter Bayesian approaches. The experiments give some evidence that the restricted posterior works in practice on the tasks they tried. The soft spot is exactly the one the stress-test flags: the SGD trajectory is an optimization path, so its principal components reflect training dynamics rather than the directions where the posterior has mass or where models disagree on outputs. If the models inside the subspace are too similar in their predictions, the averaging step adds little beyond a point estimate. The abstract does not show checks for predictive disagreement or comparisons against full-space posteriors on smaller models, so it is unclear how well the induced posterior approximates the true one. The math for building the subspace is simple and reproducible, and the citations track the relevant Bayesian deep learning literature without obvious gaps. This paper is aimed at people who need practical uncertainty estimates for deployed neural nets. A reader working on scalable Bayesian methods would get a usable technique and some supporting numbers from it. It deserves peer review because the problem is real and the method is specific enough to test and improve.

Referee Report

2 major / 2 minor

Summary. The paper claims that low-dimensional subspaces of parameter space, such as the first principal components of the SGD trajectory, contain diverse sets of high-performing models. Within these subspaces, standard Bayesian methods (elliptical slice sampling and variational inference) become tractable, and Bayesian model averaging over the induced posterior yields accurate predictions together with well-calibrated uncertainty on both regression and image-classification tasks.

Significance. If the central claim is substantiated, the work would be a useful practical contribution to scalable Bayesian deep learning: it shows how the geometry of an optimization trajectory can be exploited to reduce an otherwise intractable high-dimensional inference problem while preserving the benefits of full posterior predictive distributions.

major comments (2)

[§3] §3 (Subspace construction): the claim that the first principal components of the SGD trajectory contain diverse high-performing models is load-bearing for the entire argument, yet the manuscript provides no direct evidence that models sampled inside the subspace produce meaningfully different predictions (e.g., no analysis of output disagreement or predictive variance across the subspace). Without this, it is unclear whether the restricted posterior approximates the full posterior or merely averages similar functions.
[§4.2] §4.2 (Image-classification experiments): the reported improvements in calibration are presented without error bars or repeated runs; a single trajectory per dataset leaves open the possibility that the observed calibration is an artifact of a particular optimization path rather than a general property of the subspace method.

minor comments (2)

[§3.1] The mapping from full parameter vector to subspace coordinates is described in prose but never written as an explicit linear projection; adding the corresponding matrix equation would remove ambiguity.
[Figure 2] Figure 2 caption does not state the number of samples drawn from each method; this detail is needed to interpret the plotted uncertainty bands.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: [§3] §3 (Subspace construction): the claim that the first principal components of the SGD trajectory contain diverse sets of high-performing models is load-bearing for the entire argument, yet the manuscript provides no direct evidence that models sampled inside the subspace produce meaningfully different predictions (e.g., no analysis of output disagreement or predictive variance across the subspace). Without this, it is unclear whether the restricted posterior approximates the full posterior or merely averages similar functions.

Authors: We agree that explicit evidence of predictive diversity within the subspace would strengthen the central claim. While the reported gains in accuracy and calibration relative to point estimates provide indirect support, we will add a direct analysis in the revision. Specifically, we will report metrics such as average pairwise predictive disagreement and the empirical variance of softmax outputs (or regression predictions) across posterior samples drawn from the subspace on held-out data. revision: yes
Referee: [§4.2] §4.2 (Image-classification experiments): the reported improvements in calibration are presented without error bars or repeated runs; a single trajectory per dataset leaves open the possibility that the observed calibration is an artifact of a particular optimization path rather than a general property of the subspace method.

Authors: We acknowledge that results from a single SGD trajectory per dataset limit the strength of the conclusions. In the revised manuscript we will rerun the image-classification experiments using multiple independent optimization trajectories (different random seeds) and report means and standard deviations for all calibration and accuracy metrics. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical validation of subspace BMA is independent of construction

full rationale

The paper constructs low-dimensional subspaces from the SGD trajectory (first principal components) and performs elliptical slice sampling or variational inference within them, then reports that BMA over the induced posterior yields accurate predictions and calibrated uncertainty on regression and image classification tasks. These performance claims are evaluated on held-out test data and do not reduce by construction to the subspace definition or any fitted parameter; the subspace is an input chosen for computational tractability, and the predictive results are an external empirical outcome. No self-citation chains, self-definitional steps, or renamed fitted quantities appear in the provided abstract or described method. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5654 in / 857 out tokens · 23064 ms · 2026-05-24T20:18:57.409761+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
cs.LG 2025-02 unverdicted novelty 6.0

Derives optimal low-rank subspace for Laplace approx in BNNs, provides scalable outperforming version, and new comparison metric.