Subspace Inference for Bayesian Deep Learning
Pith reviewed 2026-05-24 20:18 UTC · model grok-4.3
The pith
Bayesian model averaging over posteriors in low-dimensional subspaces of neural network parameters yields accurate predictions with well-calibrated uncertainty.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Low-dimensional subspaces of parameter space, such as the first principal components of the SGD trajectory, contain diverse sets of high performing models; elliptical slice sampling and variational inference can therefore be applied inside these subspaces, and Bayesian model averaging over the induced posterior produces accurate predictions and well calibrated predictive uncertainty for both regression and image classification.
What carries the argument
Low-dimensional subspaces of parameter space (e.g., leading principal components of the SGD trajectory) that contain diverse high-performing models.
If this is right
- Elliptical slice sampling becomes practical for deep networks when restricted to the subspace.
- Variational inference likewise becomes tractable inside the same low-dimensional region.
- The resulting posterior yields both accurate predictions and well-calibrated uncertainty on regression problems.
- The same posterior yields accurate predictions and well-calibrated uncertainty on image classification problems.
- Bayesian model averaging over the subspace posterior is feasible without requiring full-space inference.
Where Pith is reading between the lines
- The same subspace construction might be applied to other optimization paths such as those produced by Adam or natural-gradient methods.
- Subspace posteriors could be combined with existing approximate inference techniques that already scale to moderate dimensions.
- If the subspaces remain effective across different architectures, the method would reduce the computational barrier to Bayesian deep learning in practice.
Load-bearing premise
Low-dimensional subspaces of parameter space contain diverse sets of high-performing models.
What would settle it
If Bayesian model averaging performed inside the proposed subspaces produces predictions or uncertainty estimates that are no better than, or worse than, standard point-estimate training on the same regression and classification benchmarks, the central claim would be falsified.
read the original abstract
Bayesian inference was once a gold standard for learning with neural networks, providing accurate full predictive distributions and well calibrated uncertainty. However, scaling Bayesian inference techniques to deep neural networks is challenging due to the high dimensionality of the parameter space. In this paper, we construct low-dimensional subspaces of parameter space, such as the first principal components of the stochastic gradient descent (SGD) trajectory, which contain diverse sets of high performing models. In these subspaces, we are able to apply elliptical slice sampling and variational inference, which struggle in the full parameter space. We show that Bayesian model averaging over the induced posterior in these subspaces produces accurate predictions and well calibrated predictive uncertainty for both regression and image classification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that low-dimensional subspaces of parameter space, such as the first principal components of the SGD trajectory, contain diverse sets of high-performing models. Within these subspaces, standard Bayesian methods (elliptical slice sampling and variational inference) become tractable, and Bayesian model averaging over the induced posterior yields accurate predictions together with well-calibrated uncertainty on both regression and image-classification tasks.
Significance. If the central claim is substantiated, the work would be a useful practical contribution to scalable Bayesian deep learning: it shows how the geometry of an optimization trajectory can be exploited to reduce an otherwise intractable high-dimensional inference problem while preserving the benefits of full posterior predictive distributions.
major comments (2)
- [§3] §3 (Subspace construction): the claim that the first principal components of the SGD trajectory contain diverse high-performing models is load-bearing for the entire argument, yet the manuscript provides no direct evidence that models sampled inside the subspace produce meaningfully different predictions (e.g., no analysis of output disagreement or predictive variance across the subspace). Without this, it is unclear whether the restricted posterior approximates the full posterior or merely averages similar functions.
- [§4.2] §4.2 (Image-classification experiments): the reported improvements in calibration are presented without error bars or repeated runs; a single trajectory per dataset leaves open the possibility that the observed calibration is an artifact of a particular optimization path rather than a general property of the subspace method.
minor comments (2)
- [§3.1] The mapping from full parameter vector to subspace coordinates is described in prose but never written as an explicit linear projection; adding the corresponding matrix equation would remove ambiguity.
- [Figure 2] Figure 2 caption does not state the number of samples drawn from each method; this detail is needed to interpret the plotted uncertainty bands.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point-by-point below and will incorporate the suggested improvements in the revised manuscript.
read point-by-point responses
-
Referee: [§3] §3 (Subspace construction): the claim that the first principal components of the SGD trajectory contain diverse sets of high-performing models is load-bearing for the entire argument, yet the manuscript provides no direct evidence that models sampled inside the subspace produce meaningfully different predictions (e.g., no analysis of output disagreement or predictive variance across the subspace). Without this, it is unclear whether the restricted posterior approximates the full posterior or merely averages similar functions.
Authors: We agree that explicit evidence of predictive diversity within the subspace would strengthen the central claim. While the reported gains in accuracy and calibration relative to point estimates provide indirect support, we will add a direct analysis in the revision. Specifically, we will report metrics such as average pairwise predictive disagreement and the empirical variance of softmax outputs (or regression predictions) across posterior samples drawn from the subspace on held-out data. revision: yes
-
Referee: [§4.2] §4.2 (Image-classification experiments): the reported improvements in calibration are presented without error bars or repeated runs; a single trajectory per dataset leaves open the possibility that the observed calibration is an artifact of a particular optimization path rather than a general property of the subspace method.
Authors: We acknowledge that results from a single SGD trajectory per dataset limit the strength of the conclusions. In the revised manuscript we will rerun the image-classification experiments using multiple independent optimization trajectories (different random seeds) and report means and standard deviations for all calibration and accuracy metrics. revision: yes
Circularity Check
No circularity: empirical validation of subspace BMA is independent of construction
full rationale
The paper constructs low-dimensional subspaces from the SGD trajectory (first principal components) and performs elliptical slice sampling or variational inference within them, then reports that BMA over the induced posterior yields accurate predictions and calibrated uncertainty on regression and image classification tasks. These performance claims are evaluated on held-out test data and do not reduce by construction to the subspace definition or any fitted parameter; the subspace is an input chosen for computational tractability, and the predictive results are an external empirical outcome. No self-citation chains, self-definitional steps, or renamed fitted quantities appear in the provided abstract or described method. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Low Rank Based Subspace Inference for the Laplace Approximation of Bayesian Neural Networks
Derives optimal low-rank subspace for Laplace approx in BNNs, provides scalable outperforming version, and new comparison metric.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.